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which are essential for the sun-ival nf L I • 2,515 £Bnes 
fined condition or whiS arl™™ fofE^ de " 



* Corresponding author. Mailing address- Germ™ 



439 



pathway %L be t^STo^Wf 8 ° 
activity, spectrum, functional^ iSS^^" 

CURRENT RESOURCES FOR GENOMIC SEOUENCTI 
• AND ^CTIOhUUrYlNFSlnON ^ 

certhle over the Internet ^vSt W-k 

atPi&n. Mary also permit S&^T n I^T 

variety o?Xr KSfSWt <! Sa , i?J? 1 ^ * 
updated regularlv bv The / ^! on of thls table * 

•itmJ). These databases are ^ St b/mdb/mdb 
of the genonucseouences of rhn/. ™ «iensive analysis 

codor, usage to idWte'^aK? 1 * 
sent transcribed genes Putarivp u ^ ^ 10 re P rB * 

to slightly more ffitaH^S^J** ^f* aSSigBed - 
based on sequence cSJS2L*S Kn^f^ "S^™ 
in other organisms, sharedTeqne^f »o «* 

manner ^ iSf ? 3,1 and uWfui 

These databases generally ak^ JL£ ^ * nbscribe ^ 
• not available in iSSa^J^ SeqUe ? Ce Wwnutiaa 
analysis tDofcSSSfflSg »g 
the results of prerun ^woTSKISSS 8 ^' 
stored to provide rapid answers toTomnUv ^ • ■ 11107 be 
mic queries by a "SkS^fSS^K"^ E * n °- 

served in multiple pathogenic !& 1 n ^ f CM " 
genes found onlv in the mt5vi™J ? • re "gnJtion of 
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TABL£ L 061 16 targets of widely used antibiotics 

Target category and . 
gene product 



Aktimicrdb. Agents Chemdteei 



Antibiotic class 



Protein synthesis 

3DS ribosomal subuinr 

5DS ribosomal subnnit • 

tRNA^flynthetasft Z 

Elongation factor G,^. M .. -M ^,^ 

Nucleic acid synthesis 
DNA gyruse A subunir; ropo- 

isom erase" TV 
DNA gyrase B subunir., 



..Aminoglycosides, tetracyclines 
•MflcTolidcs, cJudnmphenicol 



— Mupirocm 
~£asidic add 



HNA polymerase beta rabuni 
DMA 



«.Quinolons£ 
-.M...Npvobiocm 
——..Rifampin • 
...-.-J^erronidazolc 



Cell wall peptidoglycaD synthesis 
Transpeptidases .«„...„......,„.,„... 

c-AJa-D-Ala iigase substrate ,«I 

ArithnetabQiites 

Dihydro folate reductase 

Dibydropteroate synthesis ., **" 
Tatty acid synthesis-. 



>..3eta-lactanis 
—Grycoptptides 



•Trimethoprim 
Sulfonamides 



cSS^^ kTIVE GENOMICS TO ASSESS THE ' 
SPECTRUM AND SELECXlVnY OF ATAfiS 

P 0 ^^ ™ e of genomic sequence information is to 
compare all of the identified genes in different bacterid arh£ 

dete T? ft ^E^« e! or are SKff£ 
»u species. Indeed, Tatusov et al. (50) have sSed Sat 
gene families conserved among bacteria but missal Sef 
myotes comprise a poo] of potential taigetTfebfoS™ el" 

was tajcen by Mnshegian and Koonin (36) who iricntw.rt 
256 genes shared by the two comple^y^quenced bacS 
genomes at that time, those of HaemophtiL E^fS 
Mycoplasma genuaiium. On the other hand, » 

r«tw ° ^ genes c onraon to most microbial 

» t S y -, uni ^ e 10 a Particular species. ForextrnplT 
AjigDnj et al. (6J identified 26 genes B / fflost 

H. pylon, Soytonoccus pneumoniae, and Borate bun&Shi 

predictable function, contained novel targets for broafspe ° 



ttrum antibiotic development. These analyses can be extended' 
by mdudmg sequence comparisons to eutoyo^gen^ J 
means to examme potential selectivftyTa targe: SiFb 
SS 6 ^ 0 " 1 /' * ( ? that 15 ofVproteS 

rfteS »m ^oss bacterial species also exhibited S 

toerore, represented, targets which, in an assay, night ide£ 

SS^Sf ^ to """^ ■* -honld be noted that te 
' 5 Vffl l ar mMketed ^microbial agents show 

some conservation with rnamaalian proteins 

As m all sequence comparisons, the search parameters and 
ice quality of the input data, e.g., partial human or mammalian 
K^^f 2 ^' m Relev =" issues which mua 

degree of sequence amilarhy to another bacterialeenonit 

■ f ™ B WanUi ^ a P°^ taddry problem? Shies 

SSKhS^ F^ 8 ^ ^orithms aBow nearry cont 
pleteflejdbfliiym me choice of these parameters some knoZ 
examples are necessary to calibrate ^teSSoVSSS 
and ICoonm (36) used a BLASTP score of SO as th™fffe 
defining a biologically relevant relationship beween^o prl 
™„ i£ ne6 * 716 fl PP ro - Driafe ^tofi score for otaffi 

■ SST 1 ? 81,13816111 y""" 1 1 ^ beSSS 

specific. Some examples reveal a general trend TriinZS 

toes pHFR) despite the fact that the JbSJS?^ 
DHHl gene products share 28% amiuo acid identity over the 

percentages translate into BLASTP scores of 1 33 Jin 
^1 respectively, ina search tf^JSlSSi pr'oS 

and PIR. Thereiore •exclusion of genes having apparenTS 
maJm homologs whh scores >250 would liSy beSb™ 
a search : of bactenal targets, but the score cutoff woSdbave a 



IDEN1OTCAT10N OP ESSENTIAL TARGETS 
EXPEEIMENTALLY 

Genomic sequence information is not required for discov- 
ermg essential genes, but such information does Sate the 



Comparative Denaraici 
Seieaivtry &. Speanm 



Funcii auiJ 
Ptaluirs 



VfiildaijDn G f 

BwentfaKiy 



ValidariDJ] of 
Expressioa 



Assay & 
KTS 




Sequence suniianty 
Motif content 
Proiein iaieracrion 
Dperon nBtgabon 
Structural arnilariry 



Direcifid alieic esehunge 
timdom tnuispoEDD-faasori: 
Q^cfooumnoDE 
GAJvIBIT 
JSTM 

Conditional ItthaJs 

VITA 







Mjcro-aixay 




hybridizmiDii 


> 


RT-PCR 




I VET 







Bioc±ifimicai assays 
Whole-cell assnys 
UganU-afTiniiy aBRavs 



an d gram jjegative, 
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Interact resource 
ww.tigr.Drg/idh/mdb/bidb/hjclbJitmJ 

ww.ngr.Drg/tdb/mdb/mjdb/mjdbJiimJ 
ww^axusa^r.jp/ryaao/cy^ojj^ 

wv^fab^i^Melber^c/M^Bumoniiie/ 

^tym^s^ohena^ 
j&enomejmnlr or genome-wvw^tiafbrd 
.Bflu/S&cch&ro rapes 

, f^.ti^.Drg^b/mrib/hpdb/bpdbJjtiaJ 
www.genetics.wisc.rtiu/ 

www.gcnomncoTp^Dtn/gciic/sequencw/' 



— — *j*ifc.MS4/uae!iHGiJ] toil 

www.pasteurJr/Bb/SubtnJsUrtaii 

ww.dgr.or^tdfa/iniib/aabMfrlbitaiJ 

www.tigr.orE/&db/mdbAbdWbbdb.htmi 

WWW JCbUlJjIUllb,gDV/cp>faln/F.n fr^r/ 

fraioik7db-»Genomn&gj«i33 
www.hio.Dite.go,jp/ot3db i .iadcx.hniiJ 

www^mgfir^c.ul£/PrDj M EVW tuberculosis/ 
www.ngr^rg/tdb/ffidb/Q3db/tpdh.htiiil 

www.gonomccDrp.DDm/bpyiori or wmutrfr. 
oostDzucom/bpyiori 

www.tjgr.org/cig-biii/BIjiBtScBrcb/bifi5t 
,cgi7 organism -m^tnbcrcuj aw 
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TABLE % Sequenced microbial genomes 



Genome 



£tnun(fl) 



Haemophilus influenzae RD • KW20 
Mycoplasma genitalium D-37 
Methanacocau jannaschii DSM 2661 
Synechocystb sp, FCC 66*03 

Mycoplasma pneumonias M129 

Saaharomyces cerevisiat £288 C 



25695 
K-12 
delta H 



Escherichia call 
MevtanDhaateriwn thermo* 

auxotrophicum 
Bacillus sublku 
Archaeoglobus fuigidus 
Borrelia burgdorferi 
Aqulfex aeoUcm 

Pyracoccus horikoshii 

Mycobacterium tuberculosa 
_ Treponema pallidum 
Chlamydia trachomatis 

Rickettsia prnWaze/di 
Helicobacter pylori 

Mycobacterium, tuberculosa 



16B 

VC-16, DSM4304 

H31 

VP5 

D73 

H37Rv 
Nichols 

SerovnrD p/UW3/Qc) 

Madrid E 
J9P 

CSU#93 * 



Size 
(Mb) 



In5tfnitioD(s) 



Reference 



13 

25 

B 

27 

23 



1J3 

djb TraR 

1.66 UGR 

3S7 KnrusD DNA Research Institute 

DJ31 University of Heidelberg 

13 European and North American J7 
Consorthun 



1.66 TtGR 

Unireniry of Wisconsin 
1*75 Genome Therapeutics and Onio 

Ziffle Unrveniry 
4.2 inttrnntioniiJ Consortium 
2.1B TIOR 
1.44 TIOR 
U5 Diversa 



51 

7 

43 

31 
29 
14 
ID 



25 



1.SD National Institute of Technology 

nod Bvaluanoa 
4.40 Sanger Centre 9 
1J4 TIGR end Unhrcrsiry of Texas 16 
1-05 University of California at Berice- 46 

ley and StarJorrf University 
Ul University of Uppsala 4 
1.64 Genome Therapeutics and 3 

Astra AB 
4*40 TTGR 



Unpublished 



method. can be M ? 0 d&fr SME^SE? 18 
m £ eofl, genes on be placed ujjder^of of e r^l^f P " 

ceni (11), or genes can be mutated tb a eonriirirmDi iJIu 77 ™ 
In principle, such oandtm^^^S^^^ 
cell screens under moderarJv ^Zt, b& used m whole- 

which are essentia] under the Si rf™ m ?~ gen j? 
auxotrophic mutations ni ay fcS s^ett n^H* 

nutrient effineady. In order to estabteh that aSSSt h 
.ssenbal ,n an infection a transposoa-based IX?Ej 



mooei (22, 35). However, Bince cellB canyine the dismnrTn 
tagged genes must be sown in fh* i 0 utZr g oisrupted 

a^J 0 W f Batu « suitable antinaicrobia] E e ne 
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miormarion about p Mf Jt™„ K methods ofier 

used to detect tm E mp E TS ™^ ( E" PCR i may bt 
animal infection modi" ffflACZ^ mBcfaa ° r k 
which have been -qiWiJalJS?: 5? i 015 "^ 

^ approach, which has been developed fofuLS 
;™ na2c ftpteuowm grown intniperitonealh- fa aSaS 

recovered and cloned. ^ «oS„iS rfS?^ 10 TOD> 
used for selection sncodeE a mSdffi. i ™ « W 66,16 

IDENTIKICA.XION OF ESSENTIAL TABGETS 
USING DATABASES 

pathogens ^well as M bS^^SSSSa? 

work* ViPtt^r « _T • s e "=is;. Second, the method 
r° 211 ? rolu sionaiy tool than as as indiaiona™ 

SEl " I 0 any one of ffiultil,lt causes jncludinVoolar 
ghejMK mefficent recombination ta a p^S?^ 



Antilccrdb. Agents Chemothbl 



Database or oignnizatiou 



TABLE 3. Additional Internet resources 



Sequence databases 
NCBL« 



DDBJ 



Internet itddrws 



EBI/EMBI 
GSDB 



SwtssProt (Geneva}.. 



Candida. 

MIPS 

HDP 

SGD 



Metabolic databases 
fCEGG„ 



WIT 



Sequci} dug groups 
Berkelay - 

Genomt Therapeuiics . 
Sanger 

Slanfpi'tl - M | 



Gaaome/orgJinnl 
^^/www.ddbunig < Q CL jp/i 3 teii£ tast/ 

Weicomt-tJitniJ 
— bnpyAvww.ebLac.ulc/ebi home-btml 
^hj^VAvww.ncgr.orB/^db/Index jsdbJitmJ 

-.htipy/aictt^duno.etia/CaadidiLhtmJ 
^t03^«nvw.mips t biocbfim-inpg.dD/ 
-Jittp^/rdp JIfemhic.edu/ 
«hrtp^enome-www^tanfDro\o do/ 

«Jitip^/Hrww.5eiminfi.Bd.jp/lcfigg/ 
-*ilttp^/ecocva t ?anganSygte mgRn m / 

e coeyc/eco cycJi GuJ 
^tn3^/wiw^mtjnsu.£dy/wn7 



^^^/cWainyriii>-wurpv.b^ke ley.edu ^231/ 

^vVsnquenc^vw^taiiiordedii/eroup/ 

Tjqj^ jnaJaria/indexbtml 

Washington University „ Ju^Xb^wI^**™** 

fialmoneJiaJitmJ 

Took and resources 

Biomoinukr tort Toot htW/ www .p UbBaiasnifc . edll/ _ pedro/ • 

qqq s rt^LJjtmJ 

Nrn p bt^^/w^.ncbLniinjiib^ov/COG/ 

— bttp^/www.nc^r.Drg/inirrobc/Index bus 
.html 

~'^ttp://www.mrxani.gov/b orri e/gaastcrI/ 

genomes J] tzai 
— brta^/specter.Dc-Li2ib.gDvrB004/ 
■^^AwJsuine.tdu/ca^^ 
pnbUcJjtmJ^exJiixQl " 
— 'btrpy/www.wdcnmjcen.go .Jp/ 
— ix^/wWiWhaeh/WdcomeJinnJ 
— btrp^/vAvw.minvj]c.ui^-rh. brnDDl/kctboc-Jt/ 
chapter.btmJ 
.Jiapyywww.cdc.gov/ 



MAGPIE., 
Genobase 



Micro Underground...™.. 



ANMR. 
WHO. 



CDC* 



University of Kansas.. 



Univendty of Georgia. 

TripDi _ , 

Motif Z 

Pedant- 



^^Ww^umcxdu/resc areh/Sgscteata 



GeaTHRHADER.7 



^ttp^/aingus.gEneiin.uga. eduSOBty 
— -bttpV/www.rripD5-comy'siie>s .htmJ 
«±ttp^/dna3toaford^U/iaentiry/ 
— htnj^pedaDLniipsibiDcbern..nipg.dfi/ 
— bt^/gJobiabio.wnrwicLa&uIc/EeDonie/ 
genomicfatmi ■ 



One solution to this problem is to carry out ail-] e arcw^ 
as s two-step procesB (20 22) In P , exdiange 

strongly sugg^ thai B ud» prog^ J t^ t 

£ndmg cell, r^ing only ^ wfl J. type .n^*^^ 
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White this complies^ ^SonTe^r 1 ^ 
organisms, it provides . JZ? . "senna! genes m these 
genes under cS5 fa wSt " Inethod fM « a »™Ptt* 
the resulting strS ™ b T » *" 

Anew approach promises to anvimr. «. 
wting the essentiaUry ™ ln eE S ^P 1 " 0 ^ of eyaj. 
Scribed a methodforK^ f™* " t . aL (44 ' 45 > bav * 
footprint^ wUdTaSiSSfao tSST"? 
Dt element to geaers^ !3^" rMd ™ ^Posabie 
..population of cells. Furtter^Sv ka 0ckD1Ite ^ a 
population is thtn g^Tm^TZ^Y *< Md 
.prepared from «CSS° f ""Woa, DNA is 
DNA is queried by ^ Papulations, and the 

. yield PCR proton tojfff° at,nn t0 detennine * » «® 
transposon-speca o^IT™ ^ene-specific primer and a 

able under the growtt co^dSf ^ ™ ? £, BBae wers im*- 
products are v& D ^SS^ , ^ HnnM ^ I, GR 
automated fiuoreTcence sequencing gefc ^ ^ 

method is the existence of Sin" 01 ™' C „° ntro1 in this 
. in the' waned i n cell nm..u ft l °- t ™ a P««i PCR product 
transposition. ^iKi 10 *' shntdZ of 
not simply E «coJd» sWfrfr — W9>ei ? mMte ^ that this regjon is 
method temSL^L^T*- ^ ****** ° f thfa 
.aD necessary gene toockouft S 0m ,5 ans P Dsons to b uild 
^■»d^ JS^iJSiJ?""? * ^tomated 
gene of interest nnerprer the results for any given 

B^haTbeerTapp^d^^ 

In this variation £ ? ^ bacter »J species (1). 

mgenesis was done S pg!2^ ^ position mi 
A or J.Zr°* gMomSc "gnents from 

introduced fa* ttoHK? c^' ^ ™« 

a true , 0 , the SSfflM? *™» ^ * 

. saturation mutagenesis with thr Be 5 nea£B Permits near- 
«wJ. which shots inl^^ 
authors identified CST 1 f SpeCifici * ^ 
function from a totalis™ a?* SBne? 0f 

mutations generated b SKSE? 6 baCteriuffl 80 that 
Other limitations whicn St^aS eelSf^ b *»■ 

ods include the foIlowmTm .* SeMticfDotprintingmetb- 
gene that is dunSK & f <* • 

analyzed, since footpri^f^ef J™ 01 be 

mutagenized gene- m • ^f 8868 ™ e fitness of a sinek 

from bacteria; m ^^T ap ?'^ baa of ^obtained 
gene knockout £ £ STSS ^, ^ fDD ? rinI ^£ data with 
and (iv) footprinting d^ Ll * m ** 0I &^ 

for a variety «*J^^£gff}£^» 
taal genes will- tolerate insertion! in th e p ? essM " 
«gion (e.g., nn b «h ~VtT C-termmal coding 

genes display ^^^^ f 8 ^^ * some 
«k2 [44]). mle .™ed]at e slow-growth phenotype (e.g., 
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TOOLS FOR mpicnNc THE FUNCTION ' ! 

example, the function ofthe J£L T? lt 'J* s ™ e ^ . 
been predicted based m telS^S*? 1 F*** its * 
Product of ^^^^3^"^* 8 gene 
edness to a protein of confirm.* * J^ eB ' ^ of relat- 
About half of&e «n« Sl^!^ 011 ffia ? be ev « toagBt. 

agnment or have likely homolog? ^h^K *™ e *«M«l as- 
In neither of these cria bTb r" 1010 ?. 18 u =toown. 

gene product. Nevertheless^ St f be &r *e 

searches are a useful bS^^^JT^^^ 
More sensitive sequence confnS,,' fu ? h6r ^"stigation. 
putative function orfunrtiend iSZZZ??* ? ay P rovfde 8 
a short protein sequencemoS 25 P 1 * 8 ^ "f 

a database of clusters of S( ^P^ 8 8ea « b against 
[Table 3D ^ellToveflS? ad te a f^ s K5 

a ^^s-Br^^*^^ 1 - 

sample, ageaeproducL^tSf.mm^ V"" producL F » ' 
ship to a pxote&of K W«ff Cant8e ? ueDM «Wob- 
eotranscribed i7p« «Sf? SSSr^ ^ ta ^ t0 bft 
genes f0 f known ruLtioS .lfi$S^£E*° ^ otb « 
wnh the known gene products rJ t? le ™ ^ same pathway 
hypothetical paJJ^SSS'^^ * C ° U Sraon,e - ^ 
porphyrin biosvnthf& gen? ^/,° 1 . : ™ Bclibed the 

VdM appears to be ^ o2 wS S E=ue 
usher protein HtrE which h ^™ ^ OT ter-membrans 
? is reasonable^' SSKSrl" ^P 0 " 3Bd " 
function play roles S?!^ 8 geaes oi 
^morhe^l^^^^^P^^ as their 

would be required to coS ^ese M 6 "^ 31 eviden « 
east for identifying lOri^SuSfeX"- ^ b °* ^ 
sence of strong primarv Ln^rT • 8 ™ 1 . a ' 3I 3' even in the ab- 

of known stStm^ow 9 to wK^' * the databas " 
PTDach for assigning K wTf COm ' 1 a P"«rtil ap- 

anatysis results fromTfaS! re? 0En T fTaUe 3 ^ P^ eat8 
predicted open reading S& a ? pragTaa 011 

Laboratory iaemod7ean^ D T^ three b acteria] gtn 
of unknowngene ideSftS u^Z^ 10 S ° lve 
the bait in ayeasr twobySc ^^^^ be ^ «« 
^hose protein produce fficfS^i^ l f ident ^ S enes 
The identity of an mteracS^?^^^. 
the unknown in a pardcijKS rw £ frs qa 6 n.tly implicate 

^owngenemaybTeSr^^^^ Fina ^y. sn 
purified bf affinity ■ cSS ^ »ri t& the prolein 

ries of activities such Sotoct^ SS?"? tested for ^ego- 
ATP or GTP hydrohsk ^ dBava ^ or ^ding, 

abffiry of ««iSM«SSS m fc tD . aflme The prol- 
the IsttermethoTisbtffi^^^ " F ™* no « ^ 
sequence comparisons suggesSe rfr«. may , bs warr8 uted if 
ated with an assavable SSL a presence of a motif associ- 

Thfi an*sv nf J 
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ttose genes provides both gene, targets whose ceUularfunetioa 

™ n,edlatel y » biochemical assays and 

S J bioAaiiialKMq,. Typically, the gene sequence 
pphfieri by PER from genomic DNA of a riven bakrn™ 
verted into an expression .vector, and expr^ed b£ ^ 
sometimes with affinity tags to facilitate^rificEtio™ of 2 
resulting protein product. ^uu-nuon 01 toe 

»ifi i !n^- lBfifi ,° b ! i0nsh ? W:o P r0CBBdwlth £ Me targets kckine 
mfonnation. This problem has attracted coS 

^l^ tt ™? recent y ears bBCBUSe ^ the growing number 
f B w,T SetS ^ 10 be shaled across many bacSri" 
species (24), some of which are known to be essential 
one w As a general guide, about 40% of bacterial genes 
cannot be assigned a putative function at this time. If ID to 
» ff e l e ?" ■» essential, then 4 to 6% of rhe genes ha 
a typical bacterial genome (about 100 genes) represent no- 

T * ree ba ? JC «»« of approaches seem feasible and 
have shared some mitial success. First, cella expressing hieher- 
or Inwer-than-normal levels of particular gen^S^e 
cases ten shown to be more resistant or- more sensible re! 

to cbemic/cS^£ 
known to inhibit those gene products. For example, werex- 
mm at theyeast^G7 gene results in cells more St 
T^tT: CBUS (38), while reduced^S?' 

S™ %T C g f f r °. dnCt rKUltS " CBUs more Be ^^ to the 
mG f BBS6ri "P^n of the ERG1J gene 

"^l* h * ter lBV63s of resistance to ine 
azc-le family of drugs which target that enzyme (54) A eeaeof 

.mtoown function could be overBtpresseTb a host^ak and 
""fr could be forincreaseJreS- 
STi^S? e ° m P omd * 11 ■ cl «*. however, that many' 
rw^ 8 WhM 5™P« 1B " d d ° lead to resistance £ 
chemical ccmpouuds that are known to bind to the wotefa 
product (e g., prA [52]). Furthermore, overexpreWn of S 
tems often eads to lethality Dr growth defected, MffiT 
Alternatively a gene ; could be underexpressed or ckSd by a 
mutation SD that cells might show increased sensitivity to a 
compound which inhibits the protein product sStTa? 
Microcide Pharmaceuticals, Inc. have applied t£^~Z 
on a large scale using temperature-sensitive mutants ™ « 
SnW?^ tem P eratoes » to reduce the levKat 
whatfrachon of unknown gene products would provide the cell 

S^reSufle^r " "» " « 
The second approach to this problem of assaying Bene mod- 
ucts oi unknown function is probably more £SK 

mg affinity to proteins of unknown function. This has Wn 
achieved with peptides in phage display his^caLe bS 
mg can be readily detected by elution of bound phage famtto 
^ 011 a f Ud ^PP°«- Proteins of uta ta£ 
^™ Produced easily as affiniry fusion products for 

■ S0M md a varieI y Df Peptide phage 

display libraries are commercially available. Conformational 

*» nted6 -l«ffld 8 d peptides with affinities fa the So? 
nMto 200 nM range can be obtained by this approach (SS\ Of 
course not all peptides detected by tins approach will buid to • 
site wbch mhibit activity, but an elegantnew method, called 
Validation m vivo of targets for anti-inactives" (VTTA) has 
been devised to identify those peptides which inhftit essential 
cellular lunetions (49). Fotentkl inhibitory peS were S 



■ Antimcrob. Aqeots Chbmdther. 

pressed in a regulated manner within bacterial host cells which 
were grown either on agar medium or in an animal model of 
mfechon. Inhibition of cell growth or viability upon induction 
Sh "f le f M J vaIidated peptide-protcin interaction 
^ useful for further drug development. While peptides are not 
ideal drug candidates, a wider array of techniques are appE- 
cable after a moderate binder has been obtained. The peptide 

08613 S 5 B su ?° BBte ^md ta a competition assky to 
denufy a smaD organic compound with higher affiniry. Scmtil- 
ahon proamity assays (25) or fluorescence polarization assavs 
Illl^ b t « ,^ high-throughput mode to identify coni 
SEES m *« mcal , Kblan . es oompete for binding with a 
labeled peptide. Alternatively, hgand binding assays may be 
configured to work directly on libraries of unlabeled chemical 
compounds. Shuker et al. (42) have described a nuclear ma*, 
neto resonance-based method capable of a throughput If 
1,000 compounds per day. Mass spectrometric methods are 
ato .of interest as potentially rapid ways to detect bound t 
gands from chemical Hbraries. One concern about these bp- 
■preaches is that proteins may have multiple accessible bmding 
sites,, many of which have nothing to do with catalytic activirf 

Si^ff 5Sf ^ b !" However ' 11 k wortt notin E a* 
W« M I 42 tD ? ■*^ at! * B of a. second binding site to 
mcrease &e affinity of an inhibitor fer the protein. Ultimately " 

ff n ^ , l f ffilat7 ^^ mm be soown to inhibit cell growth! 
£f antimicrobial activity. Some chemical enghW 

uptake e0mpo,md ^ be reD,lireQ » increase'micTobial 

a PP roa = h fcr gene products of unknown 

fcnetan relies on the complex gene expression regulatory net- 
work found in many bacteria. Expression levels of genes in- 

SSff /f'^'T.™ 0ftE35 regulated 111 spouse to the 
amounts of intermediates in the cell For example, disruptioD 
of the general secretory pathway in R coli by mutation results 

» T^T I S pilaatm 5 ,fi£4 seije "P^sion (37). Alksne 
et al._ (2) took advantage of this fact to build a strain t&Rui 
a^m * secA-kcZ fusion as a detectable reporter. Several 

&£r?' mi ? C T dS md P 10 ^ were identified by 

aen- abaryu^uce expression of the reporter. Many of thes 
exhibited antmucrobud activity and reduced the secretion of 
Staphylococcus aureus toxin 1. Similarly, Mdluli et al (34) havt 
reported that sublethal concentrations of isonta^J Ue^to ul 
regulation of tht haA and acpAf genes. This group has inifr 

™l^v lB " C ^ ^^.^ ««« of^eiica! 
pounds whicn mduce expression of a ludferase reporter fused 
to a gene in this regulated pathway. Screens of this type, which 
ake advantage of the bacterial gene regnlawxT newot » 

y 1 ^ SpeC 5 C ^ two other typesllcnbe her^ 
In addition, they surfer from the basic limitation of all who£ 

Brrf.r^^ C0 . n 2T« mUSt b * ^P^ 16 of eHtering'the cell is 
order o be detected. However, these types of screeis offer the 
potentiaJ advantage of identifying compounds which act a? S 
cf several points in a pathway. ^awaiany 

CONCLUSIONS 



.pJl / aty ?f S? 1011 ^ sequence information for all or 
nearly ah of several different bacterial species provides impor- 
tant new advantages for target discovery: First, ftp «rmits use of 

S?? EeDDnUC ^ ysk identi ^ Potrnxial new tar- 
gets shared across several bacterial species or particular to a 

?*;5^ 111 manaBI . " » Possible to gen. state lists of 
S?" ? preseilt P Dlential ^ets ™ broad-spectrum or 
highly ionised narrow-spectrum antibiotics. Seou&Tice comuai- 
isons can also provide some assurance against mammalia B 
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toxicity if proteins of similar sequence do not exist in mamma- 
iian sequence databases. Second, sequence similarity provides 
some insights into putative functions for most gene products. 
Finally, availability of the entire sequence of the gene target of 
interest permits rapid construction of gene knockouts to vali- 
date the utility of me target and facile construction of expres- 
sion plasmids for production of protein and development of 
assays. The fact that bacterial and fungal, genes can be assessed 
rapidly for their relevance as potential antibiotic targets by 
determining the efiect' of knocking out the gene and the fact 
that their genomes are small enough to be sequenced in their 
entirety are compelling reasons that the field of genomics will 
lilcely find irs first real utility in the development of new anti- 
microbials. 
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-^dispensable component oTo t ,3/ 
of a v ri of b . ol 

expected to grow exponentially for at \ sast 
the next few years, and conceivably £ 

Knowing the inventory of enn^J J 

^dersta^in, ltfe itself , ^ ™I» 
mdispensable for achieving this goaT beO 

tejn implicated ,n an essential functionT 
nodded m a given genome. Acco XgQ 

iy, an alternative protein for th<* 1 
Wion should be^ouit^SS 

SSTe?or d ^ ^ < 5 ^ *5r 
muinpie genome sequences, it is possible to 

delineate protein families that are hi B Hv 
curved in one domain tf aTlSK 

may be crmcally important: For exampk 
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Kna but are missing in eukaryotes comprise 

wrennes the problem of gene classification 

ifSS fe f ble to repIace * e 

wKil com P lete > consistent system in 
which the groups are likely to have evolved 
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ot genome organization (9). ^ son 
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W-fJ. I he consistency between BeTs resnW'l 
me in triangles does not depend ! 0^ 

most hkely to be informative whe n the 
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BeTs forming a triangle come from widely 
different lineages. Accordingly, only five 
major, phylogeneticaJly distant clades were 
used as independent contributors to COGs: 
GramLAegatiye bacteria (Esdherichia coU and 
ft influenzae), Gram^ositive bacteria (My- 
coplasnia zenitalium and U. pneumoniae), 
Cyanobacteria (Synechcystis sp.), Archaea 
(Suryarchaeota) (Mettumococcus jannasdui) 
and Eukarya (Fungi) (Saccharomyces cerevi- 
siae) (13). 

The procedure used to derive COGs inO 
eluded finding ail triangles formed by BeTs 
between the five major clades and merging 
those triangles rnat had a common side 
until no new ones could be joined. A triO 
angle is an elementary, minimal COG (Fig. 
1A). The groups produced by merging adO 
jacent triangles include orthologs from difO 
ferent lineages and, in many cases, paraiogs 
from the same lineage (Fig. 1, B and C) 
Because of the existence of paraiogs, the 
5eTs that form the triangles are not necesO 
sanly symmetrical: For example, in the 
COG shown in Fig. 1C, the same M. geni- 
takum protein, MG249, is the BeT for four 



paralogous cr subunits of E. coU RNA polyO 
merase, but only for one of them, RpoD, is 
the relationship symmetrical. 

Most of the clusters derived by the above 
procedure meet the definition of a COG 
that is, all of the proteins from the different 
lineages in the same cluster are likely to be 
orthologs. There are, however, several reaO 
sons why, in certain cases, COGs may be 
lumped together. Proteins may contain two 
or more distinct regions, each of which 
belongs to a different conserved family; usuO 
ally such proteins are loosely referred to as 
multidomain (M). Each of the clusters was 
inspected for the presence of multidomain 
proteins, individual domains were isolated 
U 5), and a second iteration of the sequence 
comparison was- performed with the result U 
ing database of domains. Some of the COGs 
may include proteins from different lineages 
that are paraiogs rather than orthologs, priO 
marily because of differential gene loss in 
the major phylogenetic lineages. When one 
gene m a pair of paraiogs is lost in one 
lineage but not in the others, two COGs 
that should have been distinct may be artiO 



Fi0. 1 . Examples of COGs. Solid lines show sym- 
metrical BeTs. Broken lines show asymmetrical 
BbTs, with color corresponding to the species for 
which the BeT is observed. Genes from the same 
species are adjacent; otherwise the gene names 
are positioned arbitrarily. A unique COG ID is indi- 
cated tn the upper left comer. (A) Congruent BeTs 
torm a triangle, the minima! COG. Origin of the 
proteins: KatG, £ coU; SII19B7, Synechocystis 
sp.; and YKROBBc, S. cerBVisiae. Note that all the 
BeTs are symmetrica], (B) A simple COG with two 
^f,^ aralO0£ ' Ori0in of 108 Proteins: lieS, E coir, 
influenzae; MG345, M. genrtaiium; 
* Bl\ P neumoni ae; MJ0947, M . jannaschii; 
and YBL076C and YPL040c, S. cerevisiae. Note 
the adjacent triangles with a common side for 
example, JleS-MG345-MJD947 and sill 362- 
MG345-MJ1362.-YPL040C is the yeast mito- 
chondria! isolBucyl-tRNA synthetase; the bacterial 
orthologs and that from M. jannaschii are 'the 
BeTs forthis yeast protein, but the reverse is true 
only of the bacterial proteins (symmetrical BeTs) 
Conversely, for YBL076C, which is the yeast cyto- 
plasmic IsoieucyMRNA synthetase, the M. Jann- 
aschii ortholog is a symmetrical BeT, whereas the 

r a nnJL B67 t 7* W A complex 

COG with multiple paraiogs. Origin of the proteins' 
HpoH, RpoS, RpoD, and RIA, £ coii; HIN1403 
and HIN1655. H. influenzae; MG249, M. geni- 
taiium; MP4B5, M. pneumoniae; slIOl B4 sliD306 
sir0653, SII1BB9, SB2012, and slr1564, Synecho- 
cystis sp.FtpoD, H1N1655, slrDB53, and MG249 
are major sigma factors (d-70), whose function is 
universal in bacteria; note the fully' symmetrical 
relationships between these proteins. The other 
proteins am specialized sigma factors whose ra- 
diation from the ancestral family apparently was 
accompanied by modification of the function and 
involved accelerated evolution; note the asym- 
metrical BeTs. 
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frciaily joined. Therefore, the level of seQ 
queries similariry between the members of 
each clusrer was analyzed, and clusters thar 
seemed to contain two or more COGs were 
split. 

Phylogenetic and Functional 
Patterns in COGs 

The described analysis resulted in 710 apO 
parent COGs. This set appears to be essenO 
tiaUy complete as far as orthologous relaO 
tionships are concerned. Indeed, when the 
portion of the database of proteins from 
complete genomes not included in the 
COGs was clustered by sequence similariry 
(16), only 10 groups were identified, which 
upon careful inspection of the alignments! 
were considered likely to constitute addiO 
tional COGs missed originally. These 
groups were incorporated, producing the fiO 
nal collection of 720 COGs, including 6814 
.proteins and distinct domains of multidoD 

ma ?l ( S r0teinfi f6646 dificirict £ene products, 
or 37% of the total number of genes in the 
seven complete genomes) (17). 

Most of the COGs are relatively small 
groups of proteins. Onelihird of the COGs 
(240 COGs with 1406 proteins) contain 
one representative of each of the included 
species (no paraiogs), and 192 more COGs 
include paraiogs from only one species 
most frequently yeast (87 COGs). The 
mean number of proteins per COG increasD 
es with increasing number of genes in a 
genome, from 1.2 for M. gerdtoUum to 2.9 
for yeast. A notable aspect of many COGs is 
the differential behavior of paraiogs. It is 
typical that one of the paraiogs, for examO 
pie, in .yeast, shows consistently higher simO 
ilarity to the orthologs in al] or most of the 
other species (Fig. 1, B and C). For numerO 
ous yeast paraiogs, particularly components 
of the translation apparatus, the underlying 
cause is obvious: the gene whose product is 
most similar to the bacterial orthologs is of 
mitochondrial origin (Fig. IB). A more 
common explanation for che asymmetry of 
the relationships in the COGs, however, is 
that the highly conserved paralog has re 0 
rained the original function, whereas the 
functions of the less conserved paraiogs 
have changed in the course of evolution. In 
the already considered example (Fig. 1C) 
the symmetrical component of the graph 
(solid lines) delineates the conserved funcO 
tion of the o-70 subunit of the'RNA polyO 
merase (E. coU RpoD), which is required for 
the transcription of the bulk of bacterial 
genes, whereas the asymmetrical BeTs (broO 
ken lines) are observed for cr subunits (E 
coU RpoH, RpoS, and FliA. ) involved in the 
tr^crrption of specialized gene subsets 
Wo;. Irus phenomenon appears to be 
widespread, as we found 54 9 proteins in302 




COGs whose corresponding paraloK 

er members of the COG. One may think of 
Je rapidly evolving paralogs „ p™ 0 ° f 

ciT 5"** from ^Tthin S 

conserved ones. The COGs will be an imO 
porrant resource in a systematic survey of 
the funct.onal diversification of paralogy in 
conserved gene families. P 3 ™ 10 ^ »n 

There are several large clusters in the 
current collection with complex rebtion 0 
ships between members. Two of these 
namely the adenosine triphosphatase (AT5 
Pase) components of ABC nS^on^ 
Wdine kinases, each include over So 
members. It is likely th at subsequ ent deO 
railed analyse of these large groups (for 
^mple by phylogenetic tree methods) 
wdl result m their split into several distinct 

Z * ^ a more eeneraI n °tt, COGs 
do not supplant traditional methods of phyO 
bgenenc analysis but rather provide mV 
appropnate staning material, for thesfc 
methods in particular for- a systematic analO 
ysis of phylogenetic tree topology 
cnrT? brea ^own of the 

COGs, the protein function is either known 
from direct experiments, mainly in £. coli or 

basis of significant sequence similarity to 
functionally characterized proteins from 

™ P£CieS - !- £ ^ C ° be «PWd tha* 
construction of the COGs includes autoO 
nunc prediction of the function for numerO 
ow genes, particularly from the poorly charO 

4£mt7%R 1 sub ?antial fraction of 
the COGs for which only general 

focnonal prediction, typically of biochemO 

couidT™ ^ no i ^ accual Cellular 

could be made, and for another 5%, there 
was no functional clue (Fig. 3). Each of the 
COG s mdudes proteins from at least three 
n*jor ckdes whose divergence time is estiO 
mated to be over a billion years (2J), that 
*, they all are ancient, conserved families 

Ste 0 " 31 "' if ™ essential, 
ce lular fiinctions. Therefore, the proteins 
belonging to the "mysterious" COGs are 

■ for directed ^P™' 

The distribution of proteins from differO 
eat species in the COGs shows several 
trends (F,g. 2), although the bias in the 
current collection of complete genomes (in 
particular because three lineages are reO 
quired to form a COG, all COGs had to 
have a bacterial member) must be taken 
pSST ^e comO 

fo COPc fraCtl0n ° f f TO?eins belon f?ing 
COGs * e^acest in the nearly minimal 
genomes of mycoplasmas (70% for M. gem- 



"km) md much lower in the larger geD 
nomes of E cot and yeast ( 40 % a J 26% 
^ecnvely), which indeed is the tendency 

S? Tnl "* ^ h °^«Ping funcO 
H mSl gKle£ j ^ ^thogenic bacteria 
(H. "ftra and two mycoplasmas) are 

essenmlly subset cf me rwo larger wS 
fene complements^. coiiand JUS 

^ iT , L,ULrs - ^ne mam cause of 
tal^* * likel ? to be the 

fe«5^f - m ""J*" 1 * species from difO 
ferent major clades. Accordingly, the fact 
Aat proteins from the pathogeSc kctS 
are missing m many COGs most likely teTo 
nf.es to gene 1 0K , which ^ ^ U 



Th*" I? 8 " ° f W conserved 

m a OOG with E. cob or Synfichocvstis is 
measurably more frequent tnVdS witfc 
y^r (Fig. 2). Such a distribuncTc/dt 
archaea genes appears to be due prirnarSv 

(iO), although the mentioned bias in the 
genome collection is also a factor 

The phylogenetic distribution of the 
h™S mbe ? distinct for different funcO 

S^T? (Fig -. 2:i - Ir k not unexpected 
that translation is the only category in which 
ubiquitous COGs are predomii A™£ 
obvious trendis the absence of pmteins mS 

£ d ff p, t faacKria and, pSl 

ocularly, the mycoplasmas) in many COGs 




Amino acid 
metabolism 
993 and 

transport . 



balong to the particuiar COG. The COG^n^^ ^ Dre PSrato 9 s fro ^ ttSjlS 
denommator) is indicated tor aacn wloLT^ fr»™*>r) and the'number of proteins fSS 
mnotbnai categories (used in the ^ ^ ^ leftniBst 
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in each functional category other than nans 0 
nS? d especially in the 

metabolic functional elates. Conversely, the 
congruence between rhe two nonparasitic 
bacteria, £. col and Synechocy^ sp., holds 
for all funcnonal classes (Fig. 2). Also apparO 
ent is the differential appearance of archaeal 
proteins that tend to group with yeast proO 
terns in die translation and transcription 
classes (which, given the bias in the genome 
collection, results in ubiquitous COGs) but 
- £ i *^ cri0I *l Masses are frequently 
found in COGs with bacteria! proteins only! 

i he phylogenetic distribution of COG 
^T^l T be """niffldy presented in 
terms of phylogenetic patterns," which show 

cies (Fig. 3 . Of the 88 patterns that include at 
least three lineages (the definition of a COO) 
36 were actually found. Missing were mostly 
patterns with only one of the two species of 
Mycoplasma which was predictable because 
the gene complement of M. genitaUwn is esO 
senually a dwof the M. pneumonia comO 
plement (22). The remaining eight patterns 
that were never observed all include pathoO 
genie bacteria without E. coU, which is the 
largest and most diverse of the available bacO 
tenai genomes. The two most abundant patO 
terns could easily be predicted: all specie 
( ehgpcmy'), and all species except for the 
mycoplasmas ("eh_cmy"). What appears 
much less tnvial is that these patterns togethO 
er encompass only oneWhird of all COGs 
This feet emphasizes rhe remarkable .fluidity 
of genomes m evolution, revealed in spite of 
the fact that the analysis concentrated on 
ancient conserved families. Multiple solutions 
for die same important cellular function apD 
pear to be a ru e rather than an exception, at 
east whenphybgeneticaJly distant species are 
considered (JO, 23). On the other hand, the 
eight most frequent patterns, which together 
account for 85% of the COGs, all irkde" 
both E. col and Synectocora, emphasbing the 
congruency between these genomes. 



_ The 114 ubiquitous COGs, most of them 
including components of the translation and 
r^cnpt^chineTy, form the universal 

SL ll^I^ SK twofold 
down from the bacterial. "minimal set" conO 
™g of 256 genes (23), but signified 
£™ ^'°*.«etns unlikely, given the 
broad spectrum of compared genomes, 
mr*^' distribution of the 

only 45% rf the COGs including rep esenO 
taaves of Bacteria, Archaea, and EukVya, is 
another manifestation of the dynamics of 
gene families in evolution (Fie. 3) The 

5 k Tfl" become e ™ 

COP '.f d ****** of threeGbmain 
COGs will pro bab y drop, once archaealO 
only, eularyoticliinly, and archaealGndQuO 
karyodc COGs emerge with the accumukO 
aon of genome sequences. 

The unusual, rare patterns are of particO 

unexpected findings. Each of the COGs 
with patterns that occur only once in our 
current collection (Table 1) should correO 
spond to a unique function scattered over 
disconnected branches of the tree of life 
Why such functions are conserved and are 

out not other lineages is a challenge to be 
addressed experimentally. The principal 
evolutionary mechanisms that can be ft 
voked to explain the emergence of th«e 
n»e patterns are differential gene loss and 
hc^ntal transfer of genes. Some of the 
functions involved, for example, liooateO 

ase (tRNAJ synthetase, appear to be strictly 

performed by two distinct sets of orthobss 
unrelated n> one another (24). Other WO 
tions, for example, thymidine phosphorylO 
a e and hexuronate dehydrogenases, may be 
dispensable under most conditions, and acO 

reS,? l\ ««« is likely; it is 

remarkable, however, that these functions 



are preserved in rhe nearly minimal eene 
complements of the mycoplasmas. Two of 
die unique patterns, namely "_gpc_y » 
-h|P_y, might have evolved thraueh 
Wontal transfer of typical eukaryotic 
genes into bacterial genomes. The latter 
pattern is of particular interest as it involves 
the choline kinase gene common to a numD 
oer of bacterial pathogens and implicated in 
pathogenicity (25). Two of the COGs with 
unique patterns, "h_c_y" and "e_gp mv " 
include highly conserved but uncharlcterO 
aed proteins whose functions could be preQ 
dieted only by detailed analysis of conO 
served protein motifs (Table 1). These exO 
amples demonstrate the potential for proO 
tern function prediction inherent in the 
construction of the COGs themselves 

is small and biased, and when a more comO 

rnr S S b ^ Vaikblei distribution of 
CULrs by phylogenetic patterns is likely to 
change signifceandy; for example, many 
patterns that are currently, rare may become 
common when larger genomes from the 
Gram^os.t.ve bacCfiriaI lineage (sud] 

BaaUus subtte) become available. NeverO 
the less, we believe that rhe language of 
phylogenetic patterns will become even 
more useful for the description of relationO 
snips between multiple genomes. 

Connecting and 
Expanding the COGs 



Baet :^^ a=ya B-^atoa Ba= fc ^ iB+AMW 



Pattern- COOs 



mS^SZmL COOg > afcfcg " 



Bacteria only 
Patt ern coos 




C °° B( *> « 3D ~ Bo- 

ris. 3. Phytooenetic patterns in COGs. Letter codes a* hB„,r * 
absence otthe respects spede , Sha 4^ 
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Ancient families of paralogs that span a 
broad range of taxa are well known (26) 
Accordingly, a number of COGs are related 

superfamil.es. in order to elucidate the suO 
perfamily ■structure of the COG coUection 
we used the recently developed PSI&LAST 
(position Specific iterative BLAST) D roU 
gram, which combines BLAST search with 
profile analysis (27). Two COGs were conO 
s.dered connected if at least two of the 
proteins fom the first COG hit members of 
the second COG in the PSIlBLAST search 

produced 58 superfamilies including 280 

perfamdies are a higher level of protein 
classificanon. Typically, tKey include conO 
served motifs that are determinants of a 
distinct biochemical activity, whic h, hcwO 
evet^ may be required for a variety of celhiG 
to ftrnctions. For example, the largest suO 
perfam! y contains 53 COOs with 863 proO 
teins, all of which contain conserved motifs 
typical of ATPases and GTPases but are 
mvolved m a broad range of processes from 
DNA rephcanon to metabolite transpon 

Superfamilies and their signature motifs 



will be useful in classifying proteins that 
have evolved to an extent that chey canD 
not be assigned to any COG but still 
retain a conserved motif. We sought to 
detect such proteins with distant, subtle 
similarity to CQGs that might be encoded 
in the analyzed genomes. The PSiCBLAST 
analysis (27) detected "tails" of distantly 
related proteins (a total ■ of 3686) for 321 
COGs, increasing the total number of proO 
terns connected to COGs to 10,331 (58% 
of the entire protein set from complete 
genomes). 

Because apparent orthologs from at least 
three major clades were required to form a 
COG, there are potential new COGe hidD 
den among the results of the comparison of 

Fm^ % SfiqUenCe£ from corn P lete gnomes 
[11). Clustering by sequence similarity the 
proteins not included in COGs (14) resultO 
ed in 443 groups with members from two 
^ Predicrabl y» & e greatest number, 
204, were from the cyanobacterial and 
GramWiegative clades, followed by 67 
groups combining yeast and M. jannascM. 



Many of chese groups are likely to become 
COGs once additional genomes are includO 
ed in the analysis. 



Prediction of Protein Functions 
with the COG System 



The COG system allows automatic funcO 
nonal and phylogenetic .annotation of 
genes and gene sets (29). As in the proceO 
dure used for the construction of the COGs, 
die criterion for adding likely orthologs 
from other genomes to the COGs is based 
on the consistency between the observed 
relationships. A protein is compared to the 
database of protein sequences from comU 
piete genomes (11) and is included in a 
COG if at least two BeTs mil into it. Given 
that the COGs were constructed from proO 
terns encoded in complete genomes, it is 
not a requirement that newly included proO 
terns also originate from a complete geO 
nome. Indeed, while the unsequenced porO 
tion of a genome may encode proteins with 
the highest similarity to those included in 



COGs, the BeTs will not change for the 
products of already sequenced genes. 

As a demonstration of the principle 
coupled with additional characterization 
of the COGs themselves, the sequences of 
proteins with known threeuiiniensional 
structures from the PDB database (30) 
were compared to the protein sequences 
encoded in complete genomes. The "two 
BeT" procedure resulted in proteins with 
known three Lflimensional structure being 
included in 183 COGs, of which one was 
shown to be a fake positive by subsequent 
alignment analysis. Thus, structural inforO 
mation could be inferred for at least 25% 
of the COGs. In mosr cases, the structurO 
ally characterized protein (from £. coU or ' 
yeast) actually belongs to a COG or is a 
closely related homolog of the proteins 
forming a COG. 

^Some of the predictions, however, proO 
vide significant functional and structural 
inferences. Of particular interest are (i) 
the possibility of modeling the nuclease 
domain of polyadenylate cleavage factors 



Je pattern designations are as in Fig. 3; each COG ID includes a letter indicating the functional 



Pattern and 
COG ID 



.. Proteins 



Activity or function 



COG0213F' 
e— P_ y 
COGD246G 



COG0095H 



eh_pc_y 
COG0604R 



COG067BR 

— gpc_y 

COGD631R 
COG0423J 



e_gp_my 
COG0B22R 



eh_pcmy 
COG007BE 

-hgp_y 
COG0510tvT 



DeoA-MG051 -MP090- 

MJ0667 
MtID, UxaB, UxuB, Ydfl, 

YefQ-MPI 90-YEL070W 

YNR073C 

LplA-MG270-MP450- 
(sl!0809)-YJL046w 



AdhC + 1B£ co// 

protBins-MP27B-sII0990 

slr)192-YBRD46c + 19 

yeast proteins 
HINl693_1-sfl1621- 

YLR109W 
MG10B-MP586-SII1771- 
■ slhO33-sllO602-YDLO06w 

+ 6 yeast proteins 
MG25 1 -MP4B3-MJ022B- 

YPR081 c, YBR121C 



D230D-MG2D7, 
MP029-MJ0B23, 
MJ093B-YHR012W 

Argl, ArgF, 

YgeW-HIN001 2-MP531- 

sfIO902-MJ0BB1-YJL088w 
HIN093B-MG356, 

MP310-YDR147W, 

YLR133W 



Comment 



'This uuu was Bdded to the collection by cluster analysis. 



Thymidine phosphorylase; 

salvage of deoxypyrimidines 
Mannitol-1 -phosphate and 

other hexuronate 

dehydrogenases; hexuronate 

catabolism 
Upoaterprotein ligase A; ligation 

of iipoate to apoproteins of 

pyruvate dehydrogenase and 

other lipoatB-dependent 
. enzymes 

Alcohol dehydrogenase class III 

and related. Fe-S 

dehydrogenases; various 

catabolic pathways 
Giutaredoxin-like membrane 

protein (prediction) 
Protein serine and threonine 

phosphatase 

GlycyMRNA synthetase 
(eukaryotic and Gram-positive 
type) 



Phosphoesterase (prediction) 



Ornithine carbamoyttransterase; 
arginine biosynthesis 

Choline kinase (prediction) 
involved in lipopolysaccharlde 
biosynthesis 



Nonessential gene in £ coir, apparent orthologs found in 
other Gram-positive bacteria and in humans (35). 

Nonessential genes in £ co//; accessory reactions of 
carbohydrate metabolism (36). 

There are two unrelated classes of Kpoate-protein Kgases; 
£ coti and yeast encode both forms; K influenzasmct 
Synechocystis sp, encode the B form (included in a 
separate COG); sIJOB09 is a distant homoiog of ft A 

rJ£ ( f J' Wh "? was not aLrtD ™ticaily included in the 
COG but was detected with PSf-BLAST 
Highly conserved protein family distinct from other FekS 
oxidoreductases. 



The H influenzae protein contains an additional 

thioredoxin-likB domain. 
Serine and threonine protein phosphatases are abundant 

in eukaryotes but not in bacteria {33). aumusr ^ 

Gram-negative bacteria and Synechocystis encode a ' 
distinct glycyMRNA that appears to be unrelated to the 

SS^^^?^"^ ^^^reS^eof 
this COG m £ co/; and H influenzae is prolyMRNA ' 
synthetase (24). 
Highly conserved protein fami^that shares only modified 
catalytic motffe (detected by PSI-BLAST; P - 0 D04) 
wfth other phosphoesterases, inciucding protein ' 
phosphatases. 

Amino acid metabolism appears to be complete^ missing 
in M, genmhum, but residual reactions may occur in M 
pneumoniae. y l ' 

Enzyme common to several bacterial pathogens and 
eukaryotes; contributes to pathogenicity (25). 
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[31) with the beta Dacca mase structure, 
(ii) the presence of an acylphospharase 
domain in hydrogenase expression factors, 
which form a highly conserved COG, and 
in a number of uncharacrerized proteins, 
and (iii) the connection between a unique^ 
carbonic anhydrase and an ace ty loans 0 
ferase family (Table 2). 

Probably the most important applies 0 
tion of the COGs is functional characterO 
izacion of newly sequenced genomes. In 
the preliminary analysis of the recently 
published genome of the major human 
bacterial pathogen Helicobacter pylori (32), 
813 proteins (51% of the gene products) 
from this bacterium were included in 453 
preexisting COGs and 143 new COGs 
(33). In spite of the face that many H v 
pylori proteins are highly similar to holl 
mologs from E . coZi and other bacteria and 



have been explored in detail (32), this 
analysis produced over 100 additional 
functional predictions (33). 

Conclusions and Perspective 

The COGs bring together the fields of 
comparative genomics and protein classiO 
fication. Among the numerous possible 
approaches to protein classification, the 
COGs appear to be unique as a prototype 
of a natural system! which has as its basic 
unit a group of descendants of a single 
ancestral gene. Typically, such a group is^ 
associated with a conserved, specific funcu 
tion, so that xhe inclusion of a protein in 
a COG automatically entails functional 
prediction. 

Each COG contains conserved genes^ 
from at least three phylogenetically disO 



tant clades and, accordingly, corresponds 
to an ancient conserved region (ACR). 
Previous analyses have indicated that the 
total number of distinct ACRs is likely to 
be less than 1000 (34). Thus, even with 
the limited number of complete genomes 
currently available for analysis, the COGs 
have already captured a substantial fracO 
tion of all existing highly conserved pro 0 
tein domains. "With more genomes indudO 
ed in the system, the discovery of addiO 
tional COGs should gradually level off v 
with the gTeat majority of the ACRs enO 
coded in the added genomes fitting into 
already known COGs. 

With the forthcoming flood of genome 
sequences, a coherent framework for underO 
standing these genomes from both the funcO 
tional and evolutionary viewpoints is a 
must. We regard the current collection of 



Table 2. Structural and functional predictions for uncharacterized proteins in COGs. 



Phyiogenetic 
paitsm and 
COG ID" 



Proteins in COGt 



Activtty-and 
function 



Homoiog In PDB* 
•BeTs detected (no.) 
•Lowest P with a COG 
member 



Comment 



e_gpcmy 
COG0595R 



eh^cmy 
COGD6D7R 



ehgpc_y 
COGD596R 



e cm_ 

COG006BC 



e cm. 

COG0663R 



PhnP, 
BaC-2g-2p-5c-Bn> 
YLR277C. YMR137C, 
YKR079C 



SseA, PspE, GlpE, 
YibN, YbbB, YnjE, 
YgaP-2h-5c-MJD052-4y 



PldB, MhpC, YcdJ, 
YnbC-HIN0065- 
MG020-MP132-6C- 
YNR064C. YKL094W 

HypF-s!i0322-MJ0713 



CaiE, YrdA, Ydb2-sll1636, 
SII1031-MJ03O4 



Predicted 
Zn- dependent 
hydrolases 



Predicted 
suifur- 
transferases 



Predicted 
hydrolases and 
acyitransferases 



HydrDgBnase 
maturation 
factor 



Predicted 
carbonic 
anhydrases 



Beta-lactamase 
(1 BMC) 
•2 ■ 
-0.039 



Rhodanese (1 RHD, 
20RA, 10RB) 
•2 

•ID" 41 



Upases (2LIP, 
ITAHlB, 1CVL) . 
•3 

•B X ID' 5 

Acyiphosphatase 
(1APS) 
■2 

■2 x 1CT 5 



Carbonic anhydrase 
from 

Methanosarcina 
thermophila (1THJ) 
•3 



Activity is not known for any protein in this 
ubiquitous COG. Biochemical and genetic 
data indicate that YLR277C is involved in 
messenger RNA 3 '-end processing (37), 
whereas YMR137c is'DNA cross-link repair 
protein SNM1 (39). A motif including the 
Zn-coordinating histidines of beta-lactamase 
is conserved. 
The sufiurtransferase activity of SseA has been 
. demonstrated (40), but the rest of the 
proteins in this COG have no known activity. 
PspE (phage shock protein), GipE 
(uncharactBrized protein involved in glycerol 
metabolism), and other small proteins 
correspond to one of the two rhodanese 
domains. 

PldB is known to possess triglyceride lipase 
activity (4-?). Ail other proteins In the COG 
have not been characterized but now can be 
predicted to possess the ct- or p-hydroase 
fold. 

HypF is required for hydrogenase biosynthesis 
(42), but no biochemical activity is knom The 
-100 amino acid, NH 2 -terrninal domain 
aligns wtth acyiphosphatase, with the catalytic 
residues conserved, suggesting that HypF 
orthologs indeed possess acyiphosphatase 
activity. A PSI-BLAST search wtth this domain 
as the query detected five additional likBly 
acylphosphatases, namely E co//YccXand 
M.jannaschi! MJ0B09, MJOS53, MJ1331, 
and MJ1405 (43). 

The biochemical activity of the proteins in this 
COG is not known. They show not only 
conservation of hlstidine residue comprising 
the active center of this unusual carbonic 
anhydrase (44) but also significant similarity to 
acetyltranslerases of the isoieucine patch 
supertarnily (45), suggestin g an unexpected 
connection between the two types of 
enzymes. 



"ThB designations are as in Table 1 and Fig. 3. 
accession is indicated in parentheses. 



t2g indicates two proteins 1rom M. genttalium, 2p indicates two proteins from M. pneumoniae, and so -forth. $Tne PDB 
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COGs as a crude first version of such a 
framework. Inclusion of additional, phyloO 
genetically diverse genomes and further deli 
velopment of the procedures used to derive 
and analyze COGs will hopefully result in 
refinement of this system, making it a solid 
platform for genome annotation and evoluO 
tionary genomics. 
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Microbial pathogen genomes - 
new strategies for identifying 
therapeutics and vaccine targets 



Advances in high-throughput DNArsequencing techniques have given us the 
unprecedented ability to rapidly determine the nucleotide sequences of entire 
bacterial genomes. The application of these methods to the genomes of microbial 
pathogens, combined with efficient analytical tools and genome>scale approaches 
for studying gene expression, is revolutionizing our approach to the selection of 
targets for drug screening and vaccine development This is bringing new life to this 
important but long-neglected, field of research. 



The decision, several years ago, by the US Deparnnent 
of Energy, the National Institutes of Health (NIH) and 
^several international funding agencies to embark upon 
programs to map and sequence the human genome 
has led to a number of important technological 
advances that are- beginning to have an impacr in other 
areas of biology. Among these advances are the devel- 
opment of automated methods for the generanon of 
large amounts of raw DNA-sequencing information, 
computer software for rapidly processing and analyz- 
ing primary sequence data, and techniques for the 
rapid assembly of shotgun sequencing reads, even from 
entire bacterial genomes. Efficient algorithms for simi- 
larity searching allow the rapid identification of pro- 
tem-en coding sequences that are- homologous to other 
genes, the sequences of which are held in public and 
private databases; as from April 1996, approximately 
500 me gab as es (Mb) of nucleotide sequence were 
contained in GenBank, and approximately 200 'ODD 
sequences were held in the SWISS-PROT/Genpept/ 
PIR database of non-redundant proteins. Combined 
with the wealth of did chemical informadon that 
is archived in public databases, it has become possible 
to describe rapidly the full repertoire of genes in a mi- 
crobial genome, and to predict marry of the meta- 
bolic pathways that an organism may utilize. 

Progress in this field has been stimulated by the inter- 
ests of the biotechnology and pharmaceutical indus- 
tries in using genome-sequencing data as a basis for 
drug discovery. In turn, this has led to the develop- 
ment of proprietary databases containing genomic 
informaiion, which provide the basis for in siUco experi- 
ments to identify novel targets for drugs, and for 

D. R. Smith (smith@mucom) is at Genome Hicraptuiks Corporation, 
100 Beaver Street, Wahham, MA 02154, USA. 



laboratory experiments to idennfy genes chat perform 
critical functions. This article summarizes some recent 
■ developments in this important area, focusing on bac- 
terial sequences, and provides examples- to illustrate 
how genome-sequencing information from microbial 
pathogens can be used to select targets for vaccine and 
drug development. The overall process used to pro- 
ceed from sequence generation to target validation is 
illustrated in Hg. 1. 

Large-scale sequencing of bacterial genomes 

Many laboratories use automated sample-prepar- 
ation techniques and fluorescence-based gel readers 
[such as that produced by Applied Bios^ystems Inc., 
(ABI); Foster City, CA, USA] for the large-scale 
sequencing of bacterial genomes. These instruments 
have the advantage thai they are efficient, ynrl relatively 
easy to set up and operate, A few laboratories use com- 
puter-assisted multiplex sequencing to achieve the 
same end 1 . In multiples: sequencing, sarn.pl es consist- 
ing of pools of up to 20 plasmids are processed through 
sample preparation and gel electrophoresis^ and the 
resulting sequences are determined from elecrroblots 
of the gels by hybridization with radioacrdve or fluor- 
escently labeled probes. This technique cajn be used to 
generate 40 films (or digitised images) from each 
sequencing geL Although multiplex sequencing is effi- 
cient at producing large amounts of 'shortgun 1 data, it 
is more difficult to set up and operate in the labora- 
tory than is fluorescence-based gel sequencing, and it 
is not suited to directed-rmishiiig strs-tegies. ABI 
machines are used in the authors laborato ry to gener- 
ate primer-directed reads for finishing and- gap closure. 

During the past year, a group at The Institute for 
Genomic Research (T1GR; Gaithersburg;, MD, USA) 
reported the complete sequences of Jriaemophilus 
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influenzae (1.8 Mb), a major cause of respiratory infec- 
tions and meningitis, especially in children*, and of 
Mycopkumc genitalium (0.6 Mb), which causes ure- 
chnns 3 . Approximately 1.6Mb of contiguous 
sequence from the 4.7 Mb Escherichia coti genome has 
been published", and the sequencing of a further 2 Mb 
was reported, at me 1995 Genome Sequenrine and 
Analysis VH (GSA-VH) meeting*. The genome of 
Helicobacter pylori (1.7Mb), the major cause of 
stomach, ulcers, has been sequenced by Genome 
Therapeutics Corporation (GTC; Waltham, MA, 
U*A) under a privately funded micro bial-pathogen 
sequencing program. More than half (1.5 Mb) of the 
-.8 Mb genome of Mycobacterium leprae (the eriologic 
agent of leprosy) has also been sequenced by GTC 
and is available through GeaBank. the GTC web 
sit^. <htrp//.T Vww . e ric.coiii>, and through MycDB 
<htrp://www.biochemJaL se/MycDB.html>, which 
Srm^S DbaCterial 561101316 ^PP^ 3ndsequence 
Other microbial pathogens thai are currently being 
sequenced include Neisseria gonorrhoeae (University of 
Oidafaorrn, Norman, OK, USA), Streptococcus pyogenes 
0-Wr*ry of Oklahoma), Treponema pallidum 
(University ofTexas, Houston, TX, USA, and TIGR), 
Mycobacterium tuberculosis (GTC and the Sanger 
Centre fWon, Cambridge, UK), and Staphylococcus 
aureus [GTC, and Human Genome Sciences (HGS; 
Kockville, MD, USA)]. 

In addition to these pathogens, the genomes of sev- 
eral arcaaebacreria and other non-pathogens are being 
^t^^ ThsSC Metharmcoccus janasckii 

U IGK), Pyrococcus Jiiriosis (University of Utah, Salt 
L*ke Qty, UT, USA), Suljolobus solfataricus Palhousie 
Uruversiry, Halifax, Nova Scotia, Canada), and 
Pywbaculum aewphilum (California Institute of 
Technology, Pasadena, CA,- USA, and University of 
California, Los Angeles, CA, USA). The 1.7 Mb 
genome of the archaeon frlethanabacterium thermo- 
autotrvphicum is near completion at GTC (Kef. 7) 
ApproMinately 2Mb of the 4.1Mb Badlhts tubals 
genome has now been sequenced by a consortium of 
European and Japanese laboratories, and the project 
naay be completed by the end of 1996 (Kef. 8) 
Approximately 1 Mb of genomic sequence from the 

iof,^ 0 ™ of the cyanobacterium Synechocystis 
sp. 6803 was recently published 9 . 

Within the next couple of years, therefore, we can 
expect an explosion of bacterial-genome sequence 
mformanon from species representing a variety of 
pnyiogenenc lineages, including many pathogens. 

Pharmaceutical companies have shown considerable 
interest in .using pathogen genomics to facilitate the 
development of vaccines and small-molecule thera- 
peutics. For example, researchers at Glaxo Wellcome 
nave sequenced a substantial faction of the E. pylori 
genome to assist in the process of drug discovery. Over 
die past year, GTC has formed two research affiances 
with pharmaceutical companies to take advantage of 
sequences from microbial pathogens: one with 
Astra AB, focusing on the development of new anti- 
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^ Figure 1 

Flow Diagram iterating the process by which a microbial genome sequence is 
analysec f and te mmrton is used to tfirect experiments and aid m target selec 
E w rfEveiopment The individual steps are referred to throughout 

tne text in the case of vaccine candidates, gene products from selected targets are 
expressed and tested in animal models. 

bioria and vaccines to treat H. pylon infection, and 
one with Schering-Plough (Union, NJ, USA), to 
develop broiid-specinim antibiotics and vaccines. 
Although the genomic route to drug discovery for 
bacterial pathogens is new and remains unproved the 
basic paradigm (outlined below) of gene identification, 
followed by functional analysis and drug screenine is 
well established. Thus, it is likely that more companies 
will become involved, and chat in the the nature, ad- 
dinonal research alliances between genomics com- 
panies and the pharmaceutical industry will material- 
ize in this area* 



From sequence to genes 

The first task when confirmed with an entire bac- 
ienal-genome sequence, is to identify all the genes 
This can be accomplished using a variety oftech- 
niques, but the most successful approaches use a combi- 
nanon of reading-frame and codon-usage analysis 
together with amilarity searching, to identify putative' 
genes with homology to previously described se- 
quences. Commonly used tools include GeneMark 10 
GenomeBrowser", BLAST (Ke£ 12), and highly 
parallelized implementations of the Snuth-Waterman 
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Well, according to the algorithms, It folds Ifke this!' 



alignment,, such as BLAZE, or MPsrch (Re£ 13). i n 
general, organism-specific codon usaee is highly pre- 
dictive for bacterial genes, bur its efeerive use depends 
on the existence of sufficient information to generate 
accurate codornasage matrices,' In some cases, subsets 
or genes within an organism will exhibit codon-usage 
patterns that deviate agmfi candy from the norm* 4 . 
Such genes are thought to represent evolurionarily 
recent acquisitions by phage transduction, conju- 
gation, or some other form ofhorizontal transfer from 
other organisms. If enough of these genes are present, 
codon-usage rabies of genomic subsets can be con- 
structed to identify them. Translation^ start sites can 
be identified by the occurrence of start codons that 
_ pomade with abrupt changes in codon 'usage, the in- 
itiation of homology to previously characterized 
genes, or the presence of Srnne-Dalgarno sequences 15 . 
Automated analysis tools (such as GenomeBrowser 11 ) 
that provide a graphical display of open reading frames 
(OKFs), codon usage, database homologies and other 
leatures, make the task, of identifying bacterial genes 
and their relationships with each other in the genome 
relatively srraighuorward, With the increasing pace of 
bacterial-genome sequencing, there is an emerging 
need for second-generation tools that will automate 
most of the laborious annotation process. 

From genes to function 

The second phase in the analysis of bacterial 
genomes is to identify the function of as many genes 
as possible. Currenriy, sequence homology is the most 
powerful tool. A high degree of homology between 
the putative translation product of a newly identified 
gene and an enzyme whose function has been 
thoroughly studied in other organisms, provides strong 
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support fcr the function of thar protein, especially if it 
is the orJy homolog in the genome under s crurinv 
Other useful tools include programs that idennfV 
seguence morns from databases such as PR.OSHTF 
(Rx£ 16) BLOCKS (Re£ 17), BEAUTY (Ref S 
and ftoDom (Re£ 19). If one fc ^ 
idennry vacant candidates, then e^^mghkhiy 
expressed cei-surface proteins is relevant, so it 
useful to know whether a protein contains a secretion 
ngnal, even if nothing-else is known about it. Although 
the tools described here art very good at ideiitifW 
homologies, 25-40% of the genes in a bacterial 
genome typically fail to show significant siinilariry 
■with mown proteins. 

Once the set of snnbrity-scaicaing tools has been 
exhausted, one must return to molecular biolosy to 
further ebadate the function and expression pattern 
of predicted genes. Commonly used approaches to 
idennrying essential genes in an organism ^rTn dc the 
use of gene knockouts, disruptions using rransposcn- 
medtated mutagenesis, or homologous recombination 
with disrupted gene-constructs thar contain an ann- 
monc-resstance cassette. Gene disruptions can be 
gtneratedma variety of ways, indudirog sophisticated 
hit-and-run approaches that interrupt a gene with- 
out m»duang polar effect into downstream ORFs 
(Re£ 20). However, a gene-by-gene approach to the 
study of a whole genome is certainly nine consuraine 
and labor intensive. 

The avaikbiUty of large amounts of genome- 
seguence mformarion has stimulated the development 
ofnewapproaches to functional analysis on a genomic 
scale. This has been particularly true for researchers 
mvesngatmg yeast, where a concerted effort is bdne 
made to ascertain the function of every OKI in&- 
genome. Such strategies include the conceptually 
ample, but technologically advanced, technique of 
making rmcroarrays of polymerase chain reaction 
(PCK^amphned gene sequences on glass slides to 
allow the fluorescence-based detection of quantitative 
nybnoizanon signals from labeled cDISA probes on 
large numbers of genes, simultaneously - perhaps even 
all the genes of an organism*. An ingenious POU 
based approach to efficient seguence-sisnature-based 
expression analysis has recently been demonstrated* 
For example, a technique termed 'g-enetic finger- 
prmnng promises to replace individual eene knode- 
outs by a global mnsposon-mutag e nesis"ap P roacl 1 3 
Insemons are induced « masse m a o£iaaaat m ' 

the s^ram » grown under a variety of conditions,' 
and PCR products are analysed to identify genes in 
which transposon hops are under-represented because 
the genes are required for growth^. A. concentuaUy 
surular dropout technique, which uses tagged' trans- 
posons to identify the Salmonella typkimurium genes 

S# vm * a,ce » a moase 

Techniques that probe subseo of genes for a specific 
runcuonahty, such as secretion or induction during 
growth in the host, have also been described. These 
techniques provide clones from wroich signamre 
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sequences can be derived, so that corresponding 
genes can be identified by comparing them with 
the genomic sequence. The IVET (in vivo expression 
technology) technique, which detects gene fusions 
that result in the in vivo selectable expression of a 
defective p UT A gene or anrifaioric-resistance ^ 
has been used to identify Salmonella genes, the expres- 
sion of which is induced when the pathogen is grown 
in .mice 25 . Finally, protein microsequenring 26 and 
rnass-spectrometry-based peptide analysis 27 have 
been used to identify protein components (e.g. outer- 
membrane proteins) in partially purified mixtures, 
or to identify specific proteins separated by two- 
oirnensional gel electrophoresis. Sequences generated 
in this manner can be used to correlate specific pio- 
teins with the gene sequences from which they are 
expressed 



Target selection and validation 

The techniques described in the previous section 
can be used to identify genes in specific functional 
categories that may represent good targets for drug or 
vaccine development In general when developing 
new antibiotics, one is interested in genes that are 
essential under ail growth conditions (and preferably 
even in quiescent ceils), and for which inhibitors with 
useful chemical properties, such as permeability and 
low- toxicity, can be identified. One advantage of 
having the entire sequence of a genome is that targets 
can be prioritized in terms of their activities and the 
properties of compounds that are known to interact 
with them. Even with the results of knockout or in 
vivo expression experiments, additional biological 
informarion can aid in narrowing down the field of 
choices. For example,- genes can be selected on the 
basis of their probable roles in intracellular metab- 
ohsm. Databases, such as EcoCye (Ref. 28) or PUMA 
(JUL 29), that describe known metabolic pathways 
can be helpful in this regard Detailed structural infor- 
_ mation about homologs of identified genes (deter- 
mined using the Protein DataBank 30 ) can be used to " 
assist in the molecular modeling of inhibitors (some 
resources for molecular modeling* can be found at ' 
Ref 31). 

As more genomes are sequenced, it win become 
possible to identify genes that are unique to a par- 
ticular organism or group of organisms, or genes 
that are conserved in certain groups. Thus t for' 
example, it will be possible to- use electronic com- 
parison to identify genes that are present in H. pylon 
but not in other .gut-dwelling bacteria such as R cok\ 
provioing a basis for the development of antibiotics 
specme to H. pylori. Although combinatorial chem- 
istries promise to speed up our ability to synthesize 
and screen large numbers of unique chemical entities, ' 
the sequence-based approach described here provides 
an avenue for the ■rational identification and selection 
of key targets for therapeutics development. Ulri- 
mare validation of the targets will, of course, require 
additional experiments such as protein exoression, 
biochemical-assay development and ~ animal 



studies to identify those with the most useful 
properties or inhibitors. 
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The structure-function relationship within the DNA 
binding site of the Escherichia coli replicative helicase 
DnaB protein was studied using nuclease digestion, 
quantitative fluorescence titration, centrifugation, and 
fluorescence energy transfer techniques. Nuclease di- 
gestion of the enzyme-single-stranded DNA (ssDNA) 
complexes reveals large structural heterogeneity within 
the binding site- The total site is built of two subsites 
differing in structure and affinity, although both oc- 
clude -10 nucleotides. ssDNA affinity for the strong sub- 
site is —3 orders of magnitude higher than that for the 
weak subsite. 

Fluorescence energy transfer experiments provide di- 
rect proof that the DnaB hexamer binds ssDNA in a 
single orientation, with respect to the polarity of the 
sugar-phosphate backbone. This is the first evidence of 
directional binding to ssDNA of a hexameric helicase in 
solution. The strong binding subsite is close to the small 
12-kDa domains of the DnaB hexamer and occludes the 
5 '-end of the ssDNA. The strict Orientation of the heli- 
case on ssDNA indicates that, when the enzyme ap- 
proaches the replication fork, it faces double-stranded 
DNA with its weak subsite. The data indicate that the 
different binding subsites are located sequentially, with 
the weak binding subsite constituting the entry site for 
double-stranded DNA of the replication fork. 



The DnaB protein is an essential replication protein in Esch- 
erichia coli (1) which is involved in both the initiation and 
elongation stages of DNA replication (2-4). The protein is the 
E. coli primary replicative helicase, i.e. the factor responsible 
for unwinding the duplex DNA in front of the replication fork 
(5, 6). The DnaB protein is the only helicase required to recon- 
stitute DNA replication in vitro from the chromosomal origin of 
replication. In the complex with ssDNA, 1 the DnaB protein 
forms a "mobile replication promoter." This nucleoprotein com- 
plex is specifically recognized by the primase in the initial 
stages of the priming reaction (1). 

In solution, the native DnaB protein exists as a stable hex- 
amer, composed of six identical subunits (7-9). Sedimentation 



* This work was supported in part by National Institutes of Health 
Grant GM-46679 (to W. B.); John Sealy Memorial Endowment Fund 
Grant 2545-95; and NIEHS, National Institutes of Health, Grant 
5P30ES06676. The costs of publication of this article were defrayed in 
part by the payment of page charges. This article must therefore be 
hereby marked "advertisement" in accordance with 18 U.S.C. Section 
1734 solely to indicate this fact. 

t To whom correspondence should be addressed: Dept. of Human 
Biological Chemistry and Genetics, University of Texas Medical Branch 
at Galveston, 301 University Blvd., Galveston, TX 77555-1053. 

1 The abbreviations used are: ssDNA, single-stranded DNA; dsDNA, 
double-stranded DNA; AMP-PNP, 0,^imidoadenosine-5' -triphosphate; 
CPM, 7^ethyiamino-3-(4'-maleimidyiphenyl)-4-methylcoumarin; Fl, 
fluorescein. 



equilibrium, sedimentation velocity, and nucleotide cofactor 
binding studies show that the DnaB helicase exists as a stable 
hexamer in a large protein concentration range, specifically 
stabilized by magnesium cations (7, 8). Hydrodynamic and 
electron microscopy data indicate that six protomers aggregate 
with cyclic symmetry in which the protomer-protomer contacts 
are limited to only two neighboring subunits (7, 10, 11). Sedi- 
mentation velocity and electron microscopy studies reveal that 
the DnaB hexamer undergoes dramatic conformational 
changes upon binding AMP-PNP and ssDNA, and provide di- 
rect evidence of the presence of long range allosteric interact 
tions in the hexamer, encompassing all six subunits of the 
enzyme (8, 11). 

Recently, we obtained the first estimate of the stoichiometry 
of the DnaB helicase-ssDNA complex and the mechanism of the 
binding (12-14). Using the quantitative fluorescence titration 
method, we determined that the DnaB helicase binds ssDNA 
with a stoichiometry of 20 ± 3 nucleotides/DnaB hexamer and 
that this stoichiometry is independent of the type of nucleic 
acid base (13). Our thermodynamic studies of binding of ssDNA 
oligomers to the DnaB hexamer show that the enzyme has a 
single, strong binding site for ssDNA (12). The results also 
show that the same binding site is used in the binding to 
oligomers and polymer nucleic acids (12, 13). Moreover, photo- 
cross-linking experiments indicate that the ssDNA binding site 
is located predominately, if not completely, on a single subunit 
of the hexamer (12, 13). 

The reaction catalyzed by a helicase, the unwinding of a 
duplex DNA, must take place in the DNA binding site. The fact 
that the helicase uses the same single DNA binding site, when 
forming a complex with polymer ssDNAs, oligomers, and rep- 
lication fork substrates, indicates a complex structure of the 
nucleic acid binding site that can accommodate both ssDNA 
and dsDNA 

In this communication, we report the analysis of interactions 
between the DnaB helicase and DNA within the total DNA 
binding site of the enzyme. We present direct evidence that the 
total DNA binding site of the helicase is structurally and func- 
tionally heterogeneous. The total binding site is built of two 
subsites, each encompassing approximately 10 nucleotide res- 
idues. We provide direct proof that the DnaB hexamer binds 
ssDNA in a strictly single orientation, with respect to the 
polarity of the sugar-phosphate backbone of the nucleic acid. 
The results indicate that the binding subsites are sequentially 
located along the nucleic acid lattice, with the weak binding 
subsite constituting an entry site for the duplex part of the 
replication fork. 

MATERIALS AND METHODS 

Reagents and Buffers — All solutions were made with distilled and 
deionized >18 megaohms (Milli-Q Plus) water. All chemicals were 
reagent grade. Buffer T2 is 50 mM Tris adjusted to pH 8.1 with HC1, 5 
mM MgCla, 10% glycerol. Buffer H is 50 mM Hepes adjusted to pH 8.1 
with HC1, 5 mM MgCL 2 , 10% glycerol. The temperature, AMP-PNP, and 
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salt concentrations are indicated in the text. The fluorescent markers, 
CPM, and fluorescein 5'-isothiocyanate, used in the modification, were 
purchased from Molecular Probes (Eugene, OR). 

DnaB Protein— The E. coli DnaB protein was purified, as described 
previously by us (7, 15-17). The concentration of the protein was spec- 
trophotometrically determined, using extinction coefficient = 
1.85 x 10 5 cm" 1 M" 1 (hexamer) (7). 

Site-directed Mutagenesis of the DnaB Helicase— Replacement of the 
arginine residues at position 14 from the N terminus of the DnaB 
protein and obtaining the DnaB protein variant, R14C, were performed 
using the plasmid RLM1038, harboring the gene of the wild type DnaB 
helicase, generously provided by Dr. R. McMacken. The site-directed 
mutagenesis was accomplished in the NIEHS Center facility (National 
Institutes of Health) directed by Dr. T. Wood. 

Labeling the DnaB R14C Variant with Fluorescent Markers— Label- 
ing of the 6 cysteine residues of the DnaB variant, R14C hexamer, with 
CPM was performed in H buffer (pH 8.1, 100 mM NaCl, 5 mM MgCl 2 , 
10% glycerol) at 4 °C. The fluorescent label was added from the stock 
solution to the molar ratio of the CPM/R14C -25. The mixture was 
incubated for 4 h, with gentle mixing. After incubation, the protein was 
precipitated with ammonium sulfate and dialyzed overnight against 
buffer T2. Any remaining free dye was removed from the modified 
R14C-CPM by applying the sample on a DEAE -cellulose column and 
eluting with buffer T2 containing 500 mM NaCl. The degree of labeling 
was determined by absorbance of the marker at 394 nm using the 
extinction coefficient of CPM, e 394 = 27 X 10 3 cm" 1 m" 1 , providing the 
value of 5.8 ± 0.1 of CPM per DnaB hexamer. 2 

Nucleic Acids— All nucleic acids were purchased from Midland Cer- 
tified Reagents (Midland, TX). The etheno-derivatives of nucleic acids 
were obtained by modification with chloroacetaldehyde (12, 18). Oli- 
gomer dT(pT) 19 , labeled at the 5'-end with fluorescein, 5'-Fl-dT(pT) J9 , 
was synthesized using fluorescein phosphoramidate (Glen Research). 
Labeling of the 3'-end was performed by synthesizing dT(pT) 19 with the 
last residue at the 3 '-end of the oligomer having the amino group on a 
six-carbon linker. The amino group was subsequently modified with 
fluorescein 5'-isothiocyanate to obtain dT(pT) 19 -Fl-3\ The degree of 
labeling was determined by absorbance at 494 nm (pH 9), using the 
extinction coefficient, 7.6 x 10 4 M" 1 cm" 1 (13). The same procedures 
were used for labeling the 5'- and 3 '-ends of the dA(pA) 9 . The concen- 
trations of labeled oligomers were spectrophotometrically determined 
at 260 nm (pH 8.1), using extinction coefficients, 1.76 x 10 5 m" 1 cm" 1 
and 11.4 x 10 5 m _1 cm -1 , respectively (13). The concentrations of 
deA(p<A) 9 , d<=A(peA) 8 , dftA(p«=A) 7 , deA(peA) 6 , deA(peA) 5 , d«=A(peA) 4 , and 
d<A(pt A) 3 were determined using extinction coefficients 37 x 10 3 33 3 x 
10^ 29.6 X 10 3 , 25.9 x 10 3 , 22.2 x 10 3 , 18.5 x 10 3 , and 14.8 x 10 3 m" 1 
cm 1 at 257 nm, respectively (12, 13, 19). Labeling the 5 '-ends of ssDNA 
oligomers with 32 P was performed using the standard procedure (12). 

Sedimentation Velocity Measurements — Analytical sedimentation ex- 
periments were performed using an Optima XL-A analytical ultracen- 
trifuge. Analyses of the sedimentation runs were performed as we 
previously described (8, 9, 13). The reported values of sedimentation 
coefficients were corrected to standard conditions, s 20ut , for solvent 
density and viscosity (7). 

Fluorescence Measurements— Ml steady-state fluorescence measure- 
ments were performed using the SLM-Aminco 48000S and 8100 spec- 
trofluorometers (20). The emission spectra were corrected for a wave- 
length dependence of the instrument response using a software 
provided by the manufacturer. The binding of the DnaB protein was 
followed by monitoring the fluorescence of the etheno-derivatives of 
ssDNA oligomers (A„ x = 325 nm, A em = 410 nm). All titration points 
were corrected for dilution and, if necessary, for inner filter effect using 
the formula (15), 
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fluorescence increase of the nucleic acid, &F, upon binding the DnaB 
protein is defined by the equation, 



(Eq. 1) 



where F ie ^ is the corrected value of the fluorescence intensity at a given 
point of titration i, F- is the experimentally measured fluorescence 
intensity, B ( is the background, V i is the volume of the sample at a given 
titration point, V n is the initial volume of the sample, b is the total 
length of the optical path in the cuvette expressed in centimeters, and 
A '«* is tne absorbance of the sample at the excitation wavelength. 
Computer fits were performed using KaleidaGraph software (Synergy 
Software, PA) and Mathematica (Wolfram Research, IL). The relative 



(F im -F 0 ) 



(Eq. 2) 



where F im is defined by Equation 1, and F„ is the initial value of the 
fluorescence of the same solution. 

All steady-state fluorescence anisotropy measurements were per- 
formed in the L format, using Glan-Thompson polarizers placed in the 
excitation and emission channels. The fluorescence anisotropy, r, of the 
sample was calculated by the equation, 



2 S. Rajendran, M. J. Jezewska, and W. Bujalowski, manuscript in 
preparation. 
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(Eq. 3) 

where I is the fluorescence intensity, and the first and second subscripts 
refer to vertical (V) polarization of the excitation and vertical (V) or 
horizontal (H) polarization of the emitted light (16). The factor G = 
^h\Ahh corrects for the different sensitivity of the emission monochro- 
mator for vertically and horizontally polarized light (21). The limiting 
fluorescence anisotropics of fluorophores, r„, were determined by meas- 
uring the anisotropy of a given sample at different solution viscosity, 
adjusted by sucrose or glycerol, and extrapolating to viscosity = «j 
using the Perrin equation (22). 

Determination of the Average Fluorescence Energy Transfer Effi- 
ciency from CPM on the Small 12-kDa Domains of the DnaB Hexamer to 
the Fluorescein Residue Attached at the 5'- or 3' -End of the ssDNA 
Oligomers— The efficiency of the fluorescence radiationless energy 
transfer, E, from CPM (donor), located on the small 12-kDa domains of 
the DnaB protein variant R14C, to the fluorescein (acceptor), located at 
the 5'- or 3'-end of dT(pT) 19 , bound in the DNA binding site of the 
helicase, has been determined using two independent methods. The 
fluorescence of the donor in the presence of the acceptor, F^, is related 
to the fluorescence of the same donor, F D , in the absence of the acceptor 
by the equation, 



Fda = (1 - vdWd + F D i> D (l ~ E D ) 



(Eq. 4) 



where v D is the fraction of donors in the complex with the acceptor, and 
E D is the average fluorescence energy transfer from donor to acceptor, 
determined from the quenching of the donor fluorescence. Thus, the 
average transfer efficiency, E D , obtained from the quenching of the 
CPM fluorescence upon binding of the labeled ssDNA oligomer, is 
obtained by rearranging Equation 4, 



(Eq. 5) 

where, in the considered case, F D and F DA are the values of the CPM 
fluorescence intensity in the absence and presence of bound 5'-Fl- 
dT(pT) J9 or dT(pT) 19 -Fl-3'. The value of v D has been determined using 
the binding constants of the 20- and 10-mers for the DnaB helicase 
measured in the same solution conditions (13). 

In the second independent method, the average fluorescence transfer 
efficiency, E A , has been determined, using a sensitized acceptor fluo- 
rescence by measuring the fluorescence intensity of the acceptor (fluo- 
rescein) excited at 435 nm, where the donor (CPM) predominantly 
absorbs, in the absence and presence of R14C-CPM. The fluorescence 
intensities of the acceptor in the absence, F A , and presence, F^, of the 
donor are defined as follows, 



Fa = I 0 *aCat<>>f 



(Eq. 6) 



and 



Pad = (1 ~ vaWa + Lc A v A C AT <}>i 4- he 1 ^C m v D ^ A (Eq. 7) 

where /„ is the intensity of incident light, C AT and C OT are the total 
concentrations of acceptor and donor, v A is the fraction of acceptors in 
the complex with donors, c A and t D are the molar absorption coefficients 
of acceptor and donor at the excitation wavelength (435 nm), respec- 
tively; <pp and ^ are the quantum yields of the free and bound acceptor; 
and E A is the average transfer efficiency determined by acceptor-sen- 
sitized emission. All quantities in Equations 6 and 7 can be experimen- 
tally determined. For the case considered in this work, the acceptor is 
practically completely saturated with the donor, i.e. v A = 1. Thus, for v A 
= 1, dividing Equation 7 by Equation 6 and rearranging provides the 
average transfer efficiency as described by the following. 
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E A 



(Eq. 8) 



It should be pointed out that the energy transfer efficiencies, E D and 
E A , are apparent quantities. E D is a fraction of the photons absent in the 
donor emission as a result of the presence of an acceptor, including 
transfer to the acceptor and possible nondi polar quenching processes 
induced by the presence of the acceptor, and E A is a fraction of all 
photons absorbed by the donor that were transferred to the acceptor. 
The true Forster energy transfer efficiency, E, is a fraction of photons 
absorbed by the donor and transferred to the acceptor in the absence of 
any additional nondipolar quenching resulting from the presence of the 
acceptor (22). The value of E is related to the apparent quantities of E D 
and E A , by the following (23). 



E = y: 



(1-E D + E A ) 



(Eq. 9) 



Thus, measurements of the transfer efficiency, using both methods, are 
not alternatives but parts of the analysis used to obtain the true 
efficiency of the fluorescence energy transfer process, E. 

The fluorescence energy transfer efficiency between donor and ac- 
ceptor dipoles is related to the distance, R, separating the dipoles by the 
equation, 



Rl + R 6 



(Eq. 10) 



where R (t = 9790 U* 2 n' A 6 d Jf* is the so called Forster critical distance 
(in angstroms), the distance at which the transfer efficiency is 50%; k 2 
is the orientation factor; <f> d is the donor quantum yield in the absence 
of the acceptor; and n is the refractive index of the medium (n = 1.4) 
(22). The overlap integral, J, characterizes the resonance between the 
donor and acceptor dipoles. 

The fluorescence transfer efficiency of chemically identical donor and 
acceptor pairs, characterized by the same quantum yields, depends on 
the distance between the donor and acceptor, R, and the factor, k 2 , 
describing the mutual orientation of the donor and acceptor dipoles (22). 
Although in the work presented in this paper we are interested in 
relative distances between donors and acceptors, evaluation of k 2 al- 
lowed us to estimate the effect of the orientation factor on the differ- 
ences between the studied donor-acceptor distances. The factor k 2 can- 
not be experimentally determined; however, the upper (^ nitL J and lower 
(^mm) limits of k 2 can be obtained from the measured limiting anisotro- 
pies of the donor and acceptor and the calculated axial depolarization 
factors, using the procedure described by Dale et al. (24). When both 
axial depolarization factors are positive, * 2 nuut and K 2 nJin can be calcu- 
lated from K 2 „ iax = (%)(1 + <d x D > + <d* A > + S<d x n ><cF c ^» and 
K 2 mm 7 (%)d " {V*){«F D > + «F A >1 where <d x £> > and <d* A > are 
the axial depolarization factors for the donor and acceptor, respectively 
(24). The axial depolarization factors have been calculated as square 
roots of the ratios of the limiting anisotropics of the donors (CPM on the 
DnaB helicase) and acceptors (fluorescein at the 5'- or 3 '-end of the 
ssDNA oligomers) and their corresponding fundamental anisotropics 
(17). For two chemically identical donor-acceptor pairs, characterized 
by the same R n (the same k 2 4> d , and J), the differences in the transfer 
efficiencies, E y and E 2 , result exclusively from the different distances 
between the donor and acceptor, R x andi? 2 . The relative ratio of the two 
distances is then defined by using Equation 10 as follows. 



R 2 W'E^ij 



(Eq. 11) 



Determination of Rigorous Thermodynamic Binding Isotherms and 
Absolute Stoichiometrics of the DnaB Helicase-ssDNA Complexes— In 
this work, we followed the binding of the DnaB protein to the ssDNA 
oligomers by monitoring the fluorescence increase, AF, of ssDNA 
etheno-derivatives upon the complex formation. Proteins and nucleic 
acids may form complexes characterized by different spectroscopic prop- 
erties, particularly when multiple ligand binding processes are studied. 
In applying spectroscopic methods to monitor the ligand macromolecule 
interactions, one should not assume strict proportionality between the 
observed signal change and the degree of binding unless the existence 
of such proportionality has been shown (15). The general method to 
obtain therm odynamically rigorous estimates of the average degree of 
binding of the protein per ssDNA oligomer, Xv ti and the free protein 
concentration, P F , has been previously described by us (8, 15, 25). 
Briefly, the experimentally observed AF has a contribution from each of 



the different possible V complexes of the DnaB hexamer with a nucleic 
acid. Thus, the observed fluorescence increase is functionally related to 
Xf, by the equation, 



AF ~ 2 t'i&F im 



(Eq. 12) 



where AF tm ^ is the molecular parameter characterizing the maximum 
fluorescence increase of the nucleic acid with the DnaB protein bound in 
complex i. The same value of AF, obtained at two different total nucleic 
acid concentrations, N Tl and N T , 2 , indicates the same physical state of 
the nucleic acid, i.e. the degree of binding, %v h and the free DnaB 
protein concentration, P F7 must be the same. The value of 2^ and P F is 
then related to the total protein concentrations, P Tl and P^, and the 
total nucleic acid concentrations, N Ti and N Tt2 , at the same value of AF, 
by the following equations, 



where x = 1 or 2 (12, 20). 



= (P T2 -P Tl ) 
PF=Pr I -&v i )N Ti 

RESULTS 



(Eq. 13) 
(Eq. 14) 



Micrococcal Nuclease Digestion Reveals Large Structural 
Heterogeneity within the DNA Binding Site of the E. coli DnaB 
Helicase — Quantitative fluorescence titrations and photo- 
cross-linking experiments, using ssDNA oligomers, showed 
that the DnaB hexamer has a single ssDNA binding site en- 
compassing 20 ± 3 nucleotide residues and located predomi- 
nantly on a single subunit (12-14). The first evidence of the 
structural heterogeneity within the DNA binding site came 
from nuclease digestion-protection studies of the DNA in the 
complex with the helicase. In the first set of experiments, the 
complex of the DnaB hexamer with the 20-mer dT(pT) 19 la- 
beled at its 5 '-end with 32 P in the presence oflmM AMP-PNP 
was subjected to micrococcal nuclease digestion as a function of 
time. The protein was in molar excess over the 20-mer to 
ensure complete saturation of the nucleic acid. Fig. la shows 
the polyacrylamide sequencing gel of dT(pT) 19 after digestion 
with the nuclease, at different time intervals, in the absence 
and presence of the helicase. In the absence of the helicase, in 
our solution conditions, the 20-mer was digested within 20 min. 
A dramatically different behavior was observed in the presence 
of the enzyme. The digestion process was less efficient, indicat- 
ing significant protection of the nucleic acid against the nucle- 
ase by the enzyme. Moreover, at prolonged digestion times, a 
nucleic acid fragment of 10 or 11 nucleotide residues was 
strongly protected by the helicase. At the longest times, this 
was the major nucleic acid fragment on the gel, resistant to 
further nuclease action (Fig. la). 

The size of the protected fragment was not dependent upon 
the length or type of base of the oligomer bound to the DnaB 
protein, indicating that protection against the nuclease diges- 
tion is limited to the nucleic acid bound within the single DNA 
binding site of the helicase. Fig. 16 shows polyacrylamide se- 
quencing gels of dA(pA) 69 after digestion with the nuclease, at 
different time intervals, and in the absence and presence of the 
helicase. As in the case of dT(pT) 19 , the only predominant 
oligomer protected by the helicase in the complex with 
dA(pA) 69 , after prolonged digestion, is a ssDNA fragment, 10 or 
11 nucleotide residues long. 

These data indicate that, within the total DNA binding site 
of the DnaB helicase, approximately half of the -20 nucleotide 
residues occluded by the helicase are bound differently than 
the remaining half, resulting in the observed nuclease diges- 
tion pattern. Thus, these results indicate that the total DNA 
binding site of the DnaB helicase is built of two structurally 
and possibly functionally different binding subsites (see below). 

Binding of IQ-mer, deA(pcA) 9> to the DnaB Helicase— To de- 
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Fig. 1. a, autoradiogram of the 15% sequencing polyacrylamide gel 
electrophoresis of the 5'-[ 32 P](dT) 20 and DnaB protein-5'-[ 32 P](dT) 20 
complexes, after micrococcal nuclease (MN) digestion in buffer T2 (pH 
8.1, 4 °C) containing 100 mM NaCl, 1 mM CaCl 2 , and 1 mM AMP-PNP. 
The concentration of the protein and the 20-mer are 1 X 10~ 6 M (hex- 
amer) and 5 X 10~ 7 M (oligomer). Oligomers of different lengths are 
included in lane 1 as molecular markers. Lanes 2-6 show the different 



termine whether or not there is a difference in affinities be- 
tween the two subsites of the total DNA binding site of the 
helicase that could result in their different functional roles in 
the enzyme activities, we performed quantitative thermody- 
namic studies of the binding of a ssDNA oligomer containing 
only 10 residues to the DnaB hexamer. These are the partial 
nucleic acid ligands that can only interact with half of the total 
binding site of the enzyme. Fluorescence titrations of deA(peA) 9 
with the DnaB helicase, at five different nucleic acid concen- 
trations, in buffer T2 (pH 8.1, 10 °C) containing 100 mM NaCl 
and 1 mM AMP-PNP, are shown in Fig. 2a. As the nucleic acid 
concentration increases, the same relative fluorescence in- 
crease is reached at higher DnaB protein concentrations. The 
selected nucleic acid concentrations provide the separation of 
binding isotherms up to a relative fluorescence increase of 
—4.1, with the plateau at the maximum relative fluorescence 
increase, AF max = 4.3 ± 0.2. To obtain thermodynamically 
rigorous binding parameters, independent of any assumption 
about the relationship between the observed signal and the 
degree of binding, %v i7 the fluorescence titration curves shown 
in Fig. 2a were analyzed, using the approach outlined under 
"Materials and Methods." Fig. 2b shows the dependence of the 
observed relative fluorescence increase as a function of the 
average number of the DnaB hexamers bound per oligomer. 
The plot is linear, indicating that, in the studied binding den- 
sity range, there is a very similar enhancement of the nucleic 
acid fluorescence upon the binding of the oligomer to the DnaB 
protein. The value of Sv,- could be determined up to —90% of the 
observed signal change. Short extrapolation to the maximum 
value of the fluorescence increase provides the stoichiometry of 
the complex. Thus, the data show that only one 10-mer strongly 
binds to the DnaB hexamer, indicating that the structural 
differences between the subsites, resulting in protection of — 10 
nucleotide residues of ssDNA in the total DNA binding site 
against nuclease digestion, are reflected in the large differ- 
ences in the affinities between the two DNA binding subsites. 
The solid lines in Fig. 2a are computer fits of the binding 
isotherms to a single binding site that provide the binding 
constant K = (1.7 ± 0.3) X 10 7 m" 1 . Comparison with the 
binding constant for the 20-mer, deA(peA) 19 , previously ob- 
tained by us in the same solution conditions (K = 3 X 10 7 m" 1 ), 
shows that the 10-mer binds with an affinity very similar to the 
20-mer (13). Thus, the data indicate that the predominant part 
of the free energy of binding the DnaB helicase to ssDNA comes 
from the interactions of the nucleic acid with the strong binding 
subsite of the enzyme (see "Discussion"). 

Quantitative fluorescence titrations at a very high concen- 
tration of deA(peA) 9 (—8 X 10" 6 m (oligomer)) did not show 
detectable binding of the additional 10-mer (Fig. 2a). Titrations 
at higher nucleic acid concentrations are very difficult because 
they require very high concentrations of stock solutions of the 
DnaB protein, which are beyond the attainable solubility of the 



digestion times without the DnaB helicase. Lane 2, 5'-[ 32 P](dT) 20 , 0 s; 
lane 3, 30 s; lane 4, 60 s; lane 5, 300 s; lane 6, 900 s; lane 7, 1800 s. Lanes 
8-14 show the complex S'-f^PKdT^-DnaB helicase at different diges- 
tion times. Lane 8, 30 s; lane 9, 60 s; lane 10, 180 s; lane 11, 300 s; lane 
12, 600 s; lane 13, 900 s; lane 14, 1800 s. 6, autoradiogram of the 15% 
sequencing polyacrylamide gel electrophoresis of the 5'-[ 32 Pl(dA) 70 and 
DnaB protein-5'-[ 32 P](dA) 70 complexes, after micrococcal nuclease di- 
gestion in buffer T2 (pH 8.1, 4 °C) containing 100 mM NaCl, 1 mM CaCL 2 
and 1 mM AMP-PNP. The concentration of the protein and the 70-mer 
are 1 x 10~ 6 M (hexamer) and 5 X 10~ 7 M (oligomer), respectively. Lanes 
2-7 show the 5'-[ 32 P](dA) 70 at different digestion times without the 
DnaB helicase. Lane 2, 5'-[ 32 P](dA) 70 , 0 s; lane 3, 30 s; lane 4, 60 s; lane 
5, 300 s; lane 6, 900 s; lane 7, 1800 s. Lanes 8-14 show the complex 
5'-[ 32 P](dA) 70 -DnaB helicase at different digestion times. Lane 8, 30 s; 
lane 9, 60 s; lane 10, 180 a; lane 11, 300 s; lane 12, 600 s; lane 13, 900 s; 
lane 14, 1800 s. 
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Fig. 2. a, fluorescence titrations of dtA(peA) 9 with the DnaB protein 
monitored by the increase of the nucleic acid fluorescence in buffer T2 
(pH 8.1, 10 °C) containing 100 mM NaCl and 1 mM AMP-PNP, at three 
different nucleic acid concentrations (oligomer): 4.5 X 10 _7 M (□); 9.0 X 
10" 7 M (■); 1.9 x 10" 6 M (O); 3.6 x lO" 6 M (•)• 8.0 X 10" 6 M ( ♦). Solid 
lines are computer fits of the single-site binding isotherm, &F - &F tt ^ 
(KPf/il + KP F )), with intrinsic binding constant K^ = 1.7 x 10 7 m* 1 and 
= 4 -3- b, the dependence of the relative increase of the deA(peA) 9 
fluorescence upon the average number of DnaB helicase hexamers 
bound per oligomer (■). The absolute value of the average number of 
DnaB helicase hexamers bound per oligomer, Si/-, has been determined 
using the thermodynamically rigorous approach described under "Ma- 
terials and Methods." The solid line is a computer fit using the single- 
site binding isotherm (AF = AF^J/^iyU + K^)); Sv, = K x PjJ{\ + 



K,P F )) with K, 
concentration. 



1.7 x 10 7 M" 1 and AF niftx = 4.3; P F is the free deA(pcA) 9 



protein. Therefore, we used the analytical centrifugation tech- 
nique to assess the affinity of DNA to the second subsite. In 
these experiments, we used a 10-mer, dA(pA) 9 , labeled at the 
5'- or 3 '-end with fluorescein (see "Materials and Methods"). 
This approach allowed us to monitor exclusively the nucleic 
acid and the protein-nucleic acid complex without the interfer- 
ence of the protein and AMP-PNP absorbance. The sedimenta- 
tion velocity profiles (monitored at 515 nm) of the DnaB heli- 
case-dA(pA) 9 -3'-Fl mixture at the nucleic acid and helicase 
concentrations of 9 X 10~ 5 and 2 X 10~ 5 m, respectively, in 
buffer T2 (pH 8.1, 20 °C) containing 100 mM NaCl and 1 mM 
AMP-PNP, are shown in Fig. 3. The sedimentation run was 
performed at 60,000 rpm. It is clear that, initially, two inde- 
pendently moving boundaries exist. The slow moving boundary 
has a sedimentation coefficient of s 20m} = 1.4 ± 0.2, which is the 
s-2o r w value of the free dA(pA) 9 -3'-Fl. The fast moving boundary 
contains dA(pA) 9 -3'-Fl in the complex with the DnaB helicase. 
We have previously shown that the DnaB hexamer fully pre- 
serves its hexameric structure in the complex with the ssDNA 
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Fig. 3. Absorption profiles at 515 nm of the sedimentation 
velocity runs of the dA(pA) a -3'-Fl-DnaB protein complex in 
buffer T2 (pH 8.1, 20 °C) containing 100 mM NaCl and 1 mM 
AMP-PNP, at a 4.5:1 molar excess of the nucleic acid over the 
enzyme. The concentration of the DnaB hexamer is 2 x 10" 5 m (hex- 
amer), and the concentration of the dA(pA) 9 -3'-Fl is 9 x 10~ 5 m (oli- 
gomer). Solid lines are initial scans of the samples, which include slow 
and fast moving boundaries of the nucleic acid and the complex, respec- 
tively. Dashed lines are scans of the sample after the fast moving 
boundary reached the bottom of the cell. The initial part of the last scan 
indicates the location of the base line (time interval was 8 min; 60,000 
rpm). 

(8, 13). After the fast moving boundary reaches the cell bottom, 
only the slow moving boundary of the free dA(pA) 9 -3'-Fl still 
remains (dashed lines). Notice that during the sedimentation 
process, the boundary of the complex migrates in the field of 
the constant free 10-mer concentration [TJp^ :» 1/K V thus 
assuring that the enzyme always has the strong binding sub- 
site saturated with the nucleic acid. At 515 nm, one monitors 
exclusively the total concentration of the 10-mer. Comparison 
of the contributions of the slow and fast moving boundaries 
with the total absorption of the sample shows that 36% of the 
total nucleic acid concentration migrated in the fast moving 
boundary (Fig. 3). From the known total concentration of the 
DnaB helicase in the sample, the stoichiometry of the complex 
is calculated to be 1.6 ± 0.2, which indicates that at this 10-mer 
concentration we observed significant saturation (60%) of the 
second DNA binding subsite. 

Because we know the free nucleic acid concentration from 
the absorbance of the slowly moving boundary, we can estimate 
the macroscopic ssDNA binding constant for the second sub- 
site. In the considered case, the first binding subsite of the 
helicase is completely saturated with the 10-mer. Therefore, 
the partition function of the system, Z, and the degree of 
binding to the second subsite, v 2 , are as follows, 



and K 2 is defined as follows. 



K 2 



{[Tl Fr «U - I*)} 



(Eq. 15) 



(Eq. 16) 



(Eq. 17) 



Introducing the values of v 2 = 0.6 and [T]^ = 5.8 X 10~ 5 M, 
obtained from the sedimentation velocity experiments, pro- 
vides the value of K 2 = (2.6 ± 1) X 10 4 m" 1 . A similar value of 
the binding constant K 2 of the dA(pA) 9 -3 / -Fl and 5'-Fl-dA(pA) 9 
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for the weak binding subsite has been obtained using lower and 
higher concentrations of the nucleic acids. Thus, the data show 
that the affinity of ssDNA for the second subsite is -3 orders of 
magnitude lower than the affinity for the strong binding subsite. 

Interactions of ssDNA Oligomers Having Different Lengths 
with the Strong ssDNA Binding Subsite— To obtain further 
insight into the interactions of the DNA in the strong binding 
subsite, we performed quantitative fluorescence titrations of a 
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Fia 4. a, fluorescence titrations of deA(peA) 8 , deA(peA) 7 , deA(peA) 6 , 
deA(peA) 5 , deA(peA) 4 , and deA(ptA) 3 , with the DnaB protein monitored 
by the increase of the nucleic acid fluorescence in buffer T2 (pH 8.1, 
10 °C) containing 100 mM NaCl and 1 mM AMP-PNR Concentrations of 
all oligomers are 4.5 x 10 -7 m (oligomer). deA(peA) B ; ♦ , deA(p€A) 7 ; 

deA(peA) e ; O, deA(peA) 5 ; + , deA(peA) 4 ; 0, deA(p € A) 3 . For compari- 
son, the fluorescence titration of deA(peA) 9 is also included (■). Solid 
lines are computer fits of the single-site binding isotherm, AF = 
^max^WO + K^f)), with binding constants K x and AF^ as fol- 
lows: 1.5 x 10 7 M -1 and 4.3; 8 x 10 6 m" 1 and 4.3; 3.3 x 10 6 M 1 and 3.3; 
and 4 x 10 5 M 1 and 2.6 for the 9-, 8-, 7-, and 6-mer, respectively, b, the 
dependence of the natural logarithm of binding constant as a function of 
the length of the oligomer bound to the strong subsite of the DnaB 
helicase (■). The binding constants for 5- and 4-mers have an assigned 
maximum value at 1 x 10 4 m -1 , which is below the minimum affinity 
detectable in our fluorescence titrations (-5 x 10 4 M" 1 ), although the 
affinities of these oligomers could be characterized by even lower bind- 
ing constants, as indicated by the error bars. 
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series of ssDNA oligomers of different lengths. Fluorescence 
titrations of d€A(peA) 8 , d€A(peA) 7 , deA(peA) 6 , deA(peA) 5 , 
deA(peA) 4 , and deA(peA) 3 , with the DnaB helicase in buffer T2 
(pH 8.1, 10 °C) containing 100 mM NaCl and 1 mM AMP-PNP, 
are shown in Fig. 4c. For comparison, the fluorescence titration 
of the 10-mer, deA(peA) 9 , with DnaB is also included. With the 
decreasing number of residues, the relative maximum fluores- 
cence changes, and the affinity decreases. In the case of 9- and 
8-mers, the maximum fluorescence change, AF mBX , upon satu- 
ration with the helicase, is still similar to the one determined 
for the 10 -mer. However, the affinity is lower than the affinity 
of the 10-mer and is characterized by binding constants K 9 = 
(1.5 ± 0.5) X 10 7 nT 1 and K 8 = (8 ± 2) X 10 6 m~\ respectively. 
A dramatic drop in the affinity and maximum relative fluores- 
cence increase is observed in the case of the 7- and 6-mer (Fig 
4a: see Table I). No detectable binding to the helicase occurs in 
the' case of 5- and 4-mers (Fig. 4a). Titrations at very high 
concentrations of the 5- and 4-mer could only provide a semi- 
quantitative estimate of the affinities, due to the required 
DnaB concentration beyond the solubility of the protein; how- 
ever, these experiments indicate that the binding constants for 
the 5- and 4-mers are not higher than 1 X 10 4 m -1 (data not 
shown). Fig. 4b shows the dependence of the natural logarithm 
of binding constants of studied oligomers to the DnaB protein 
as a function of the number of nucleotide residues in the ssDNA 
oligomer. The plot is nonlinear, a clear indication that the 
affinity is not a simple function of the length of the nucleic acid. 
The difference of the 2 residues between 10-mer and 8-mer 
causes only an -0.3 kcal/mol decrease of the free energy of 
interactions. The difference in the 2 residues between 7-mer 
and 5-mer decreases the free energy of binding by at least ~3 
kcal/mol, practically eliminating the binding of the 5-mer to the 
enzyme in studied solution conditions. These data show that, to 
efficiently bind to the strong DNA binding subsite, the nucleic 
acid must span 6 or 7 residues. Thus, the results indicate a 
complex structure of the ssDNA strong binding subsite where 
the direct contacts between the helicase and the nucleic acid, 
decisive in complex formation, are separated by 6 or 7 nucleo- 
tides (see "Discussion"). 

Salt Effect on the Affinity of the DnaB Helicase to ssDNA 
Oligomers— Fluorescence titrations of deA(peA) 8 with the 
DnaB helicase in buffer T2 (pH 8.1, 10 °C), containing 1 mM 
AMP-PNP and different NaCl concentrations, are shown in 
Fig. 5a. As the salt concentration increases, the isotherms shift 
toward higher total DnaB protein concentrations, indicating a 
decreasing affinity of the protein-nucleic acid complex at higher 
salt concentrations. It should also be noted that AF max is lower 
at higher salt concentrations, decreasing from 4.3 ± 0.2 at 100 
mM to 2.8 ± 0.2 at 407 mM [NaCl]. A similar decrease of the 
maximum fluorescence increase upon the helicase binding has 
been observed for all other oligomers (data not shown). 

The dependence of the logarithm of the intrinsic binding 
constants for 10-, 9-, 8-, 7-, and 6-mers upon the logarithm of 
[NaCl] (log-log plot) is shown in Fig. 56. Within experimental 
accuracy, the plots are linear in the studied salt concentration 
ranges, which is different from the nonlinear behavior of the 



Table I 

Thermodynamic and fluorescence parameters of ssDNA oligomer binding to the DnaB helicase in buffer T2 
(PH 8.1, 100 mMNaCl, 1 mM AMP-PNP, 10 °C; - 325 nm, A, m = 410 nm) 



deA(peA)g 



deA(peA) g 



deA(peA>7 



Stoichiometry in) 
Binding constant K (NT 1 ) 

max 

alog/f 
alogfNaCl] 



d£A(peA) 6 



d£A(peA) 5 



1 

(1.7 ± 0.3) x 10 7 
4.3 ± 0.2 

-1.4 ± 0.3 



1 

(1.5 ± 0.5) X 10 7 
4.3 ± 0.2 

-1.5 ± 0.3 



1 

(8 ± 3) x 10 6 
4.2 ± 0.2 

-1.4 ± 0.3 



1 

(3.3 ± 1) x 10 6 
3.3 ± 0.2 

-1.5 ± 0.3 



(4 ± l)x 10 s 
2.6 ± 0.4 

-1.4 ± 0.3 



9064 



DNA Binding Site of a Helicase 




-1-2 -1 -0.8 -0.6 -0.4 -0.2 



Log[NaCI] 

Fig. 5. a, fluorescence titrations of deA(peA) 9 with the DnaB protein 
in buffer T2 (pH 8.1, 10 °C) containing 1 mM AMP-PNP, at different 
NaCl concentrations as follows. ■, 100 mM; O, 194 mM; □, 304 mM; ♦ , 
407 mM. Solid lines are computer fits using single-site binding iso- 
therm, AF - AF^Cfi^/yfl + KjP F )) 3 with AF max and K x as follows. ■, 
4.3 and 1.5 x 10 7 m' 1 ; O, 3.7 and 7 X 10 6 M" 1 ; □, 3.3 and 3 X 10 G NT 1 ; 
♦ , 2.8 and 1.3 x 10 6 m _j . b, the dependence of the intrinsic binding 
constant K x for the binding of ssDNA oligomers of different lengths to 
the strong binding subsite of DnaB helicase upon NaCl concentrations 
in solution (log-log plots) in buffer T2 (pH 8.1, 10 °C) containing 1 mM 
AMP-PNP. O, deA(peA) 9 ; deA(peA) 8 ; ■, deA(peA) 7 ; A, dtA{peA) G ; ♦ , 
deA(p€A) 5 . 

log-log plot previously determined in the case of the 20-mer, 
deA(peA) 19 (12, 13). With increasing salt concentrations, the 
affinities of all oligomers decrease, indicating that the binding 
process is accompanied by a net release of ions with the slopes 
3logK7tflog[NaCl] = -1.4 ± 0.4, -1.5 ± 0.3, -1.4 ± 0.4, -1.5 ± 
0.4, and -1.4 ± 0.4 for 10-, 9-, 8-, 7-, and 6-mer, respectively 
(27) (Table I). Thus, these data indicate that a similar number 
of -1.5 ions is released upon the complex formation with each 
of the oligomers being long enough to provide all essential 
contacts with the enzyme in the binding subsite. 

Previously, we determined that binding of a 20-mer, 
deA(peA) 19) which spans the entire total DNA binding site, to 
the DnaB helicase is accompanied by the maximum release of 
-3.7 ions (13). This number is significantly higher than the 
-1.4 obtained for the 10-mer (Table I). This comparison sug- 
gests that the interactions of ssDNA with the weak binding 
subsite are accompanied by a net release of -two ions. Another 
possibility is that interactions between the strong and weak 
subsites, simultaneously saturated with nucleic acid in the 
complex with deA(peA) 19 , result in the net release of -two 
additional ions. At present, we cannot exclude either of these 
possibilities. 

Determination of the Orientation of the E. coli DnaB Helicase 



with Respect to the Polarity of the Sugar-Phosphate Backbone of 
ssDNA, Using the Fluorescence Energy Transfer Method—De- 
termination of the mutual orientations of proteins and nucleic 
acids in the complex should be based on a method that is 
sensitive to the differences in distances between different, spe- 
cific regions of both macromolecules (17, 22). Fluorescence en- 
ergy transfer between a donor and an acceptor, placed in spe- 
cific locations on a protein and a nucleic acid, provides a very 
sensitive technique to assess the relative proximities between 
different regions of both macromolecules in the complex. The 
orientation of the DnaB helicase, in the complex with ssDNA, 
was determined by using the 20-mer, dT(pT) 19 , labeled with 
fluorescein (acceptor) at its 5'- or 3 '-end, respectively, and the 
DnaB protein variant, R14C, specifically labeled with a couma- 
rin derivative (donor), CPM, at the small 12-kDa domain of the 
enzyme (see "Materials and Methods"). If the DnaB helicase 
binds predominantly in a single orientation, with respect to the 
polarity of the sugar-phosphate backbone of ssDNA, then dif- 
ferent responses of the donor and acceptor fluorescence should 
be observed, depending on the different location of the acceptor 
on the nucleic acid. 

The elongated DnaB protein monomer is built of two struc- 
tural domains, small 12-kDa and large 33-kDa domains con- 
nected at the "hinge" region (28) as visualized by electron 
microscopy data (10, 11). In the hexamer, all protomers are 
oriented with their small 12-kDa domains in the same direction 
(10, 11). Because the protein does not have natural cysteines, 
we replaced arginine at position 14 from the N terminus of the 
protein located in the small 12-kDa domain of the enzyme with 
a single cysteine residue, using site-directed mutagenesis. Sub- 
sequently, this cysteine residue was specifically modified with 
CPM to provide R14C-CPM (see "Materials and Methods"). The 
selection of the modification site was directed by the fact that 
removal of the entire 14-amino acid fragment from the N ter- 
minus of the protein did not affect, to any extent, the biological 
functions of the protein (28). As a result of modification, the 
R14C DnaB hexamer has six CPM molecules located in the 
small domain of each protomer (R14C-CPM). Thus, 6 CPM 
residues form a ring at one end of the DnaB hexamer. 

The emission spectrum of R14C-CPM strongly overlaps the 
absorption spectrum of the fluorescein. These spectroscopic 
properties of CPM make the marker an excellent fluorescence 
donor for fluorescein (29). The presence of unlabeled dT(pT) 19 
causes very little change in the fluorescence emission spectra of 
R14C-CPM = 435 nm); however, the presence of R14C- 
CPM causes an -2-fold decrease of the fluorescence intensity of 
5'-Fl-dT(pT) 19 (X^ = 485 nm), although with the excitation at 
485 nm only fluorescein on the 5'-Fl-dT(pT) 19 absorbs light 
(data not shown). Saturation of the 20-mer with the unlabeled 
DnaB protein causes only an -8% decrease of the 5'-Fl- 
dT(pT) 19 fluorescence (data not shown). It is evident that, even 
in the absence of the energy transfer process, the presence of 6 
hydrophobic CPM residues affects the quantum yield of fluo- 
rescein at the 5 '-end of the ssDNA, which already suggests 
close proximity between the CPMs and fluorescein. The quan- 
tum yield of fluorescein is independent of the excitation wave- 
length between 400 and 500 nm (22). Thus, as expected, the . 
ratio of quantum yields of 5'-Fl-dT(pT) 19 in the complex with 
R14C-CPM and free in solution, <^/<^, is constant and equal to 
0.51 over a tested range of excitation wavelengths between 465 
and 500 nm. In this spectral range of excitation, no detectable 
fluorescence energy transfer from CPM residues to fluorescein 
occurs. Thus, this ratio of quantum yields, independent of 
excitation wavelength, reflects the change of the emission in- 
tensity of 5'-Fl-dT(pT) 19 , resulting exclusively from the forma- 
tion of the complex with R14C-CPM, in the absence of the 
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Fig. 6. a, sum of the fluorescence emission spectra ( ) of DnaB 

R14C-CPM in the presence of unlabeled dT(pT) 19 (4.5 x 10~ 7 m (oli- 
gomer)) and 5'-Fl-dT(pT) 19 in the presence of R14C-CPM (without en* 
ergy transfer) (A ex = 435 nm) in buffer T2 (pH 8. 1, 10 D C) containing 100 
mM NaCl and 1 mM AMP-PNP and the fluorescence emission spectrum 

of the complex of R14C-CPM with 5'-Fl-dT(pT) 19 (A ex = 435 nm) ( ) 

in the same buffer. Concentrations of 5'-Fl-dT(pT) 19 and the protein 
were 4.5 X 10 -7 M (oligomer) and 9.6 X 10" 7 M (hexamer), respectively. 
The fluorescence emission spectrum of R14C-CPM normalized at 476 
nm (peak) to the emission spectrum of the protein in the complex with 
5'-Fl-dT(pT) 19 (-•-•-) is also included, b, sensitized emission spectrum 

of 5'-Fl-dT(pT) 19 (A ex = 435 nm) in the complex with R14C-CPM ( ), 

obtained after subtraction of the normalized spectrum of R14C-CPM 
(see Fig. 7a) in buffer T2 (pH 8.1, 10 °C) containing 100 mM NaCl and 
1 mM AMP-PNP superimposed on the fluorescence emission spectrum 
of 5'-Fl-dT(pT) 19 in the presence of R14C-CPM (without energy trans- 
fer) ( ) obtained at the same excitation wavelength by multiplying 

the spectrum of free, labeled 20-mer by the quantum yield ratio, 

= 0.51. Concentrations of 5'-Fl-dT(pT) 19 and R14C-CPM are 4.5 X 10" 7 

M (oligomer) and 9.6 X 10 -7 m (hexamer), respectively. 



energy transfer process, and can be used to obtain the spectrum 
of 5'-Fl-dT(pT) 19 , in the presence of R14C-CPM, without the 
changes induced by the energy transfer process at any excita- 
tion wavelength (Equations 7 and 8). 

The dashed line in Fig. 6a is the sum of the emission spectra 
( A ex = 435 nm) of the R14C-CPM and 5'-Fl-dT(pT) 19 in the 
presence of unlabeled nucleic acid and R14C-CPM (without 
energy transfer), respectively, in buffer T2 (pH 8.1, 10 °C) 
containing 100 mM NaCl and 1 mM AMP-PNP. The solid line is 
the fluorescence emission spectrum of the complex of R14C- 
CPM with 5'-Fl-dT(pT) 19 at the same concentrations of the 
protein and nucleic acid as in the case of the sum of independ- 
ent components of the complex. Clearly, there is a dramatic 
difference between the sum of the independent donor and ac- 
ceptor spectra and the spectrum where both donor and acceptor 
are placed in the same complex. The emission intensity of 



R14C-CPM at 476 nm in the complex with 5'-Fl-dT(pT) 19 is 
decreased by ~ 35%, as compared with the R14C-CPM com- 
plexed with unlabeled dT(pT) 19 . The decrease of emission at 
476 nm, where there is no contribution from fluorescein emis- 
sion, indicates significant fluorescence energy transfer from 
the CPM residues located on the small 12-kDa domains of the 
DnaB hexamer to the fluorescein moiety placed at the 5 '-end of 
the bound 5'-Fl-dT(pT) 19 . 

Comparison between the sum of the spectra of independent 
components of the complex and the spectrum of the complex in 
Fig. 6a shows that the fluorescence intensity of the fluorescein 
residue of 5'-Fl-dT(pT) 19 , with the peak at 520 nm, is strongly 
increased in the complex with R14C-CPM (A^ = 435 nm). 
Recalling that fluorescein does not contribute to the CPM emis- 
sion band at 476 nm, we can normalize the spectra of R14C- 
CPM-unlabeled dT(pT) 19 and R14C-CPM-5'-Fl-dT(pT) 19 com- 
plex at 476 nm. The difference between the normalized 
spectrum of R14C-CPM-unlabeled dT(pT) 19 and the spectrum 
of the complex R14C-CPM-5'-Fl-dT(pT) 19 provides the sensi- 
tized emission spectrum of the 5'-Fl-dT(pT) 19 bound to R14C- 
CPM. The emission spectrum of 5'-Fl-dT(pT) 19 in the complex 
with R14C-CPM, without energy transfer, with the sensitized 
emission spectrum of 5'-Fl-dT(pT) 19 is shown in Fig. 6b. It is 
evident that in the presence of the donor, CPM, the fluores- 
cence intensity of the fluorescein at the 5'-end of the 20-mer is 
increased by —220%. 

Analogous experiments were performed with a 20-mer, 
dT(pT) 19 -Fl-3\ having fluorescein located at the opposite 3'- 
end of the nucleic acid. Unlike the case of 5'-Fl-dT(pT) 19 , for- 
mation of the complex with R14C-CPM causes only an -8% 
decrease of the fluorescence of dT(pT) 19 -Fl-3' (A^ = 485 nm), 
which is the same as observed in the presence of unlabeled 
protein (data not shown). This difference results from the 
larger distance between CPM residues on the small 12-kDa 
domains of the DnaB hexamer and fluorescein at the 3'-end of 
the 20-mer (see below). The dashed line in Fig. 7a is the sum of 
the fluorescence emission spectra of independent components 
of the complex, R14C-CPM in the presence of unlabeled 
dT(pT) 19 , and the fluorescence emission spectrum of dT(pT) 19 - 
Fl-3' in the presence of R14C-CPM (without energy transfer), 
in buffer T2 (pH 8.1, 10 °C) containing 100 mM NaCl and 1 mM 
AMP-PNP (A^ = 435 nm). The solid line in Fig. 7a is the 
fluorescence emission spectrum the R14C-CPM and dT(pT) 19 - 
Fl-3' complex at the same concentrations of the protein and 
nucleic acid as independent components of the complex. Con- 
trary to the situation with 5'-Fl-dT(pT) 19 , only a small differ- 
ence is observed when both the donor, CPM on the DnaB 
protein, and the acceptor, fluorescein on the 3 '-end of the 
20-mer, are placed in the same complex as compared with the 
sum of the spectra of independent components of the complex. 
The emission intensity of R14C-CPM is only decreased by 
-11% as compared with -35% observed for R14C-CPM with 
5'-Fl-dT(pT) 19 , indicating a very diminished fluorescence en- 
ergy transfer from CPM to the fluorescein moiety, when the 
acceptor is located at the 3'-end of the dT(pT) 19 . Also, the 
sensitized emission of the fluorescein located at the 3 '-end of 
the 20-mer is only increased by -43% as compared with -220% 
in the complex of R14C-CPM with 5'-Fl-dT(pT) 19 (Fig. 66). 

The dramatic difference between the emission spectrum of 
the complex of R14C-CPM with 5'-Fl-dT(pT) 19 and the spec- 
trum of the complex with dT(pT) 19 -Fl-3' clearly shows that the 
helicase binds ssDNA in a predominantly single orientation, 
with respect to the polarity of the ssDNA sugar-phosphate 
backbone. If the helicase could bind ssDNA in two different 
orientations with equal probability, then the changes in the 
spectra of the complexes with the 20-mer, labeled with fluores- 
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Fig. 7. a } sum of the fluorescence emission spectra ( ) of R14C- 

CPM in the presence of unlabeled dT(pT) 19 (4.5 x 10 ~ 7 m (oligomer)) 
and dT(pT) 19 -Fl-3' in the presence of R14C-CPM (without energy trans- 
fer) (A ex = 435 nm) in buffer T2 (pH 8.1, 10 °C) containing 100 mM NaCl 
and 1 mM AMP-PNP and the fluorescence emission spectrum of the 
complex of R14C-CPM with dT(pT) 19 -Fl-3' (A ex = 435 nm) in the same 

solution conditions ( ). Concentrations of dT(pT) 19 -Fl-3' and R14C- 

CPM are 4.5 X 10" 7 M (oligomer) and 9.6 X 10~ 7 M (hexamer), respec- 
tively. The fluorescence emission spectrum of R14C-CPM normalized at 
476 nm (peak) to the emission spectrum of R14C-CPM in the complex 
with dT(pT) 19 -Fl-3' (—* — *—) is also included, b, sensitized emission 
spectrum of dT(pT) 19 -Fl-3' (A ex = 435 nm) in the complex with R14C- 

CPM ( ) obtained after subtraction of the normalized spectrum of 

R14C-CPM in buffer T2 (pH 8.1, 10 °C) containing 100 mM NaCl and 1 
mM AMP-PNP superimposed on the fluorescence emission spectrum of 
dT(pT) 19 -Fl-3' in the presence of R14C-CPM (without energy transfer) 

( ), obtained at the same excitation wavelength. Concentrations of 

dTXpT) 19 -Fl-3' and R14C-CPM are 4.5 x 10" 7 M (oligomer) and 9.6 X 
10 7 M (hexamer), respectively. 

cein at the 5'- or 3' -ends, would be indistinguishable. 

The effect of the location of the fluorescence acceptor on the 
observed spectral properties of the studied complexes is re- 
flected in the large differences in the true energy transfer 
efficiencies, E. Using Equations 5 and 8, we obtained the ap- 
parent transfer efficiencies of E D = 0.77 ± 0.05 and E A = 
0.55 ± 0.03, respectively, for the complex of R14C-CPM with 
5'-Fl-dT(pT) 19 . This difference between E D and E A indicates 
that fluorescein, at the 5' -end of the bound dT(pT) 19) induces 
some additional nondipolar CPM fluorescence quenching. The 
true Forster fluorescence transfer efficiency from CPM, located 
on the small 12-kDa domain to the fluorescein residue at the 
5 '-end of dT(pT) 19 , is then described by Equation 9, which 
yields E = 0.71 ± 0.05. Analogous calculations of the fluores- 
cence energy transfer efficiency in the complex of R14C-CPM 
with dT(pT) 19 -Fl-3' yield E D = 0.18 and E A « 0.09 ± 0.01 
(Table II). In this case, the true Forster transfer efficiency is 
E = 0.1 ± 0.01. The large difference between the true energy 



Table II 

Fluorescence properties of 5' -Fl-dT(pT) I9 and dT(pT) 19 -Fl~3' in the 
complex with the E. coli DnaB helicase and the DnaB helicase variant 
modified with CPM (R14C-CPM) in buffer T2 (pH 8.1, 10 °C) 
containing 100 mM NaCl and 1 mM AMP-PNP 



Property 



ssDNA oligomer 



5'-FUdTtpT) lft 



dT( P T) 19 -Fl-3' 



Fluorescence anisotropy (rf 




0.28 ± 


0.01 


0.24 ± 


0.01 


Limiting fluorescence 




0.29 ± 


0.01 


0.25 ± 


0.01 


anisotropy (r 4> Y* 












Fluorescence energy transfer 




= 0.77 


± 0.04 


E D = 0.18 


± 0.02 


efficiency in the complex 


E A 


= 0.55 


± 0.04 


E A = 0.09 


± 0.01 


with R14C-CPM* 


E 


= 0.71 


± 0.04 


E = 0.1 ± 0.01 



h X 



= 485 nm. 

= 485 nm, determined using the Perrin equation (22). 
= 435 nm. 



transfer efficiencies shows that the 5' -end of the 20-mer, 
dT(pT) 19 , is in much closer proximity to the CPM residues, 
which are located on the small domains of the DnaB hexamer, 
than to the 3 '-end of the nucleic acid (see "Discussion"). 

The determination of exact distances between the donors 
(CPM) and acceptors (fluorescein) is beyond the scope of the 
present discussion on the mutual orientation between the 
DnaB helicase and the ssDNA in the complex. However, using 
Equation 11 we can estimate the approximate ratio of the 
distances between the 5'- and the 3 '-end of the dT(pT) 19 oli- 
gomer from the center of the mass of CPM donors located on the 
small domains of the DnaB hexamer. Introducing E 1 - 0.71 
and E 2 = 0.1 into Equation 11, we obtained RJR^ = 0.60. Thus, 
the average distance of the 5 '-end of the 20-mer is only 60% of 
the distance between the donors and the 3 '-end of the nucleic 
acid. 

Very similar behavior to the one described above has been 
observed when different donor-acceptor pairs have been used. 3 
These results show, for the first time, that the DnaB hexamer 
binds ssDNA in a single orientation, with respect to the sugar- 
phosphate backbone of the nucleic acid. In the complex, the 
small 12-kDa and the large 33-kDa domains of the enzyme face 
the 5'- and 3'-ends of the nucleic acid, respectively. 

DNA Mobility within the Strong and Weak DNA Binding 
Subsite of the DnaB Helicase — Assessment of the relative mo- 
bility of the different segments of the nucleic acid, within the 
DNA binding site, can be obtained by measuring the emission 
anisotropy of the fluorescent markers placed in different loca- 
tions on the nucleic acid. To determine the relative mobility of 
ssDNA in two subsites of the total DNA binding site of the 
DnaB helicase, we determined the emission anisotropy of 5'- 
Fl-dT(pT) 19 and dT(pT) 19 -Fl-3' in the complex with the heli- 
case. Anisotropics of both samples are constant across their 
emission spectra, indicating the lack of a significant local het- 
erogeneity around the fluorescent markers (spectra not shown). 
However, the anisotropy of 5'-Fl-dT(pT) 19) r = 0.28 ± 0.01, is 
significantly higher than the anisotropy, r = 0.24 ± 0.01, 
determined for dT(pT) 19 -Fl-3'. Because the fluorescence life- 
times of fluorescein in both complexes are very similar (~4 ns, 
data not shown), the obtained data indicate significantly higher 
mobility of the nucleic acid at its 3'-end. 

Analogous fluorescence energy transfer and anisotropy stud- 
ies with a 10-mer, dA(pA) 9 , labeled with fluorescein at the 5'- or 
3 '-ends of the 10-mer indicate that its 5 '-end is located in close 
proximity to the 12-kDa domain of the enzyme and has a 
similar strong decrease in its mobility (data not shown). As we 
described above, this oligomer binds exclusively to the strong 
subsite in . the DNA binding site of the DnaB helicase. Thus, 



3 Jezewska, M. J., Rajendran, S., Bujalowska, and Bujalowski, W. 
(1998) J. Biol. Chem. 273, in press. 
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fluorescence energy transfer and anisotropy data indicate that 
the nucleic acid binds with the first 10 nucleotides from its 
5'-end to the strong DNA binding subsite of the total DNA 
binding site of the helicase. 

DISCUSSION 

The Total DNA Binding Site of a Helicase— Helicases play a 
key role in all aspects of DNA metabolism, and this role is 
related to the interactions of the enzyme with ssDNA and 
dsDNA controlled by binding and hydrolysis of a nucleoside 
triphosphate, e.g. ATP (30). Understanding the functional and 
structural aspects of the DNA binding site is a prerequisite for 
our understanding of how the enzymes perform their functions. 
Yet, little is known about the structure of the DNA binding site 
of any hexameric helicase and the functional interrelations 
within the binding site. In this work, we provide the first 
insight into the complex structure/function relationship of the 
DNA binding site of a hexameric replicative helicase, the E. coli 
DnaB protein. 

Our previous studies with polymer ssDNA and ssDNA oli- 
gomers showed that in a stationary complex with the ATP- 
nonhydrolyzable analog, AMP-PNP, the enzyme has a single 
binding site located on a single subunit of the hexamer (12-14). 
Additionally, this single binding site is used when the enzyme 
binds to the DNA substrates resembling the replication fork (8, 
13, 25, 26). These results indicate that the observed single 
binding site is, in fact, the total DNA binding site of the enzyme 
that, in functional complexes on the junction between ssDNA 
and dsDNA with the replication fork, encompasses both single- 
and double-stranded conformations of nucleic acid over a 
stretch of —20 nucleotide residues. 

The operational definition of the total binding site of the 
enzymes, which perform their catalysis on polymer lattices, 
such as helicases, should refer to the complex of the enzymes 
with a polymer substrate. A total binding site of an enzyme is 
used as a single entity that interacts with a continuous stretch 
of polymer substrate. This continuous fragment of the polymer 
substrate (DNA), within the total binding site, defines the site 
size of the enzyme-nucleic acid complex. The total binding site 
can be heterogeneous, i.e. built of functionally and/or structur- 
ally different areas, subsites, specific for the catalytic functions 
of the enzyme. However, the location of the subsites is sequen- 
tial, i.e. they are placed along the polymer substrate. The total 
binding site can perform the dominant catalytic process char- 
acteristic for the enzyme, e.g., unwinding of the duplex DNA. 
Such a binding site can be located on a single subunit of an 
oligomeric enzyme, such as the DnaB helicase; thus, there may 
be several total binding sites, but only one site (one subunit) at 
a time is engaged in interactions with DNA during the cataly- 
sis. A total binding site can include several subunits of an 
oligomeric enzyme, as in the case of DNA-dependent oligomeric 
polymerases. 

Contrary to the total binding site of the enzyme, a subsite 
always interacts with a polymer DNA within the context of a 
total binding site. A subsite cannot be used as an independent 
entity in the interactions of the enzyme with polymer DNA; nor 
can it independently perform the catalysis. 

The Total Binding Site of the DnaB Helicase Is Structurally 
Heterogeneous — Nuclease digestion protection studies provide 
a clear indication of the structural heterogeneity of the total 
binding site of the E. coli DnaB helicase. Only 10 or 11 nucle- 
otide residues, within the total binding site, are strongly pro- 
tected from digestion, while the remaining 9 or 10 residues are 
accessible to the nuclease (Fig. 1, a and 6). These results 
indicate that the total binding site of the helicase, which oc- 
cludes on -20 nucleotide residues in the complex with polymer 
ssDNA, is built of two binding subsites each encompassing a 



similar number of -10 nucleotides. Experiments on the bind- 
ing of partial DNA ligands to the helicase showed a large 
difference in the affinities between the subsites and indicated 
that the 5 '-end of the nucleic acid interacts with the strong 
binding subsite of the total binding site of the enzyme. The fact 
that the nuclease can access -half of the total number of 
occluded residues within the entire binding site suggests not 
only a difference in the affinities between the subsites but also 
an open architecture of the hexamer at the subsite that encom- 
passes the 3 '-end of the nucleic acid (see below). 

The Two DNA Binding Subsites of the Total Binding Site of 
the DnaB Helicase Have Dramatically Different Affinities for 
ssDNA— Direct evidence of large differences in the affinities 
between the DNA binding subsites of the DnaB helicase comes 
from the studies of the binding of a partial ligand, deA(peA) 9 , to 
the enzyme. Using the thermodynamically rigorous method, we 
determined that only one 10-mer binds with significant affinity 
to the helicase and that the association is characterized by the 
binding constant^ - (1.7 ± 0.3) X 10 7 m _1 . The affinity for the 
second binding subsite is characterized by K 2 ~ (2.6 ± 1) X 10 4 
m _1 ; thus, it is -3 orders of magnitude lower. It is evident that 
the major part of the free energy of binding of the helicase to 
ssDNA comes from interactions with the strong binding sub- 
site. The very low affinity of the weak binding subsite indicates 
that the protein does not form efficient contacts with a single- 
stranded nucleic acid and suggests that this subsite of the 
helicase is not functionally a ssDNA binding site but rather 
that it fulfills a different role when the enzyme is in the com- 
plex with its physiological substrate, the replication fork (see 
below). 

To efficiently form a complex with the strong DNA binding 
subsite, the nucleic acid must have a length of at least 6 or 7 
nucleotide residues. No detectable affinities were observed 
with ssDNA oligomers shorter than 6 nucleotides in our solu- 
tion conditions (Fig. 4a). It is interesting that the difference of 
2 residues between 7- and 5-mer practically abolishes the af- 
finity of the shorter oligomer for the binding site, while the 
same difference between the 8- and 10-mer leads to a decrease 
of the free energy of binding by only ~0.3 kcal/mol (Fig. 4b). A 
common misconception in studying protein-nucleic acid inter- 
actions is treating both a nucleic acid and a protein as inter- 
acting regular lattices. The difference between the free energy 
of interaction of oligomers of different lengths with the protein 
is then assigned to the difference in statistical effects between 
different oligomers, which usually has very poor quantitative 
justification. We point out that the nucleic acid is the only 
macromolecule that can be approximated by a regular lattice. 
The binding site on the protein can have a very complex struc- 
ture, with distant regions making key contacts with the nucleic 
acid, hardly resembling a regular lattice. The differences be- 
tween different oligomers in binding to the DnaB helicase 
cannot be explained by any difference in the statistical effect 
between the oligomers. This is particularly true for oligomers 
shorter than 6 or 7 residues. Rather, the results suggest that 
the elements of the strong binding subsite of the enzyme, which 
makes crucial binding contacts with the nucleic acid, are sep- 
arated by a distance spanned by 6 or 7 nucleotides. The proper 
complex is formed only when all essential contacts are engaged 
in interactions with ssDNA. In this context, the similar number 
of ions released in the interactions of a 10-mer and a 6-mer 
with the helicase (-1.5) would be a result of the fact that a 
6-mer can still form all essential contacts with the enzyme, 
although the oligomer constitutes only 60% of the length of the 
10-mer (Fig. 56). 

Direct Proof That the DnaB Helicase Binds in a Single Ori- 
entation with Respect to the Sugar-Phosphate Backbone of a 
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ssDNA— Most of the studied helicases show preferential direc- 
tion in the unwinding of dsDNA, i.e. in the 5' -> 3' or the 3' -» 
5' direction (30). Therefore, it is often a priori assumed that the 
enzyme binds in strict polarity, 5' -> 3' or 3' -> 5', with respect 
to the orientation of the single-stranded nucleic acid strand. 
This is a natural assumption that simplifies current models, 
based on the still limited solution data, of how the helicase 
functions at the replication fork. However, one can argue that 
the enzyme can bind in both orientations to the nucleic acid 
lattice and that the proper orientation is imposed by specific 
interactions with dsDNA and/or multiple proteins that are 
building the machinery of the replication fork. In this context, 
it should be noted that several specific protein-protein interac- 
tions between the DnaB helicase and proteins, which are part 
of the primosome or replication fork complex, have been iden- 
tified. Although recent electron microscopy and crystallo- 
graphic data show polarity in a helicase binding to ssDNA (31, 
32), the polarity in the binding of a helicase, with respect to the 
directionality of a ssDNA strand, has never been directly 
shown for any hexameric helicase in solution. 

As we pointed out, the determination of the mutual orienta- 
tion of the protein and nucleic acid in a complex should be 
based on the method that is sensitive to the differences in 
distances between different specific regions of both macromol- 
ecules. The fluorescence energy transfer technique is such a 
method. The difference in the effect of the location of the 
acceptor, fluorescein, at the 5'- or 3'-end of the 20-mer, 
dT(pT) 19) on the fluorescence spectra of the complex of nucleic 
acid with R14C-CPM (excited in a predominantly donor absorp- 
tion band), is dramatic (Figs. 6 and 7). These dramatic spectral 
differences are reflected in the large differences between the 
energy transfer efficiencies from CPM in the small 12-kDa 
domains, all located at one end of the DnaB hexamer, and the 
fluorescein placed at the 5'- or 3'-end of the dT(pT) 19 , which 
spans the entire DNA binding site. The efficiency, E, for the 
fluorescein placed at the 5'-end of the 20-mer is 0.71 ± 0.04. 
The efficiency of the same acceptor located at the 3'-end of the 
nucleic acid is only 0.10 ± 0.01. In the case of chemically 
identical donor-acceptor pairs, the transfer efficiency depends 
on two variable factors characteristic for the studied system, 
the distance between the donor and acceptor, i?, and the orien- 
tation parameter, k 2 , which characterizes the mutual orienta- 
tion of the donor absorption dipole and acceptor emission dipole 
(22). The value of k 2 can theoretically assume any value be- 
tween 0 and 4, but only these two extreme values would sig- 
nificantly affect the determined transfer efficiency. The possi- 
ble range of k 2 can be estimated by using the standard 
procedure (Ref. 24; Table II). The obtained ranges of k 2 are very 
similar for both 5'-Fl-dT(pT) 19 and dT(pT) 19 -Fl-3' and away 
from the extreme values of 0 and 4 (Table II). Another equally 
rigorous procedure is to perform experiments with several dif- 
ferent donor-acceptor pairs that, due to different structures of 
different chromophores, provide the necessary "randomization" 
of the orientations of emission and absorption dipoles (33). 
Using different donor-acceptor pairs, we obtained similar, very 
large differences between the fluorescence energy transfer ef- 
ficiencies from the fluorophore on the 12-kDa domain of the 
DnaB helicase and the donor or acceptor placed at the 5'- and 
the 3'-end of the bound 20-mer (data not shown). The results 
clearly show that the large difference between the transfer 
efficiencies results from the large difference in the distances 
between the 5'-end and the 3'-end of the 20-mer and the CPM 
located on the small domain of the DnaB protomers. 

Our data show that the DnaB helicase binds ssDNA in a 
predominately single orientation, with respect to the polarity of 
the single-stranded nucleic acid lattice. Moreover, the data 
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Fig. 8. Schematic representation of the mutual orientation of 
the small 12-kDa and large 33-kDa domains of the single DnaB 
helicase protomer and the DNA binding subsites with respect to 
the polarity of the ssDNA, arms, and duplex DNA in the complex 
of the enzyme with replication fork, based on the results ob- 
tained in this work. The helicase is preferentially bound to the 5 '-arm 
of the replication fork using a single, total binding site of one of the six 
protomers shown in the figure. The locations of the DNA binding 
subsites within the total binding site is sequential. The weak binding 
subsite on the large 33-kDa domain faces the duplex part of the fork and 
constitutes the entry site for the dsDNA. The strong binding subsite is 
in the vicinity of the small 12-kDa domain and is engaged in interac- 
tions with the ssDNA at the 5 '-end of the arm. The 3 '-arm is not 
forming a stable complex with the helicase hexamer associated with the 
5 '-arm of the fork (26). 



show that, in the complex with ssDNA, the small domain of the 
protein is in close proximity to the 5'-end of the nucleic acid, 
while the large domain is located near the 3'-end of the bound 
ssDNA. 

Sequential Locations of the Strong and Weak Binding Sub- 
sites of the DnaB Helicase— As determined in this work, the 
partial ligand, deA(peA) 9 , binds with overwhelming preference 
to the strong binding subsite of the DnaB helicase. The transfer 
efficiency between the 5'-end of the 10-mer, dA(pA) 9) labeled 
with fluorescein, and CPM, located in the small domain of the 
DnaB protein, is very similar to the transfer efficiency between 
the CPM and fluorescein at the 5'-end of the 20-mer, dT(pT) 19 
(data not shown). These results indicate that the 5'-ends of 
both the 10- and 20-mers are at a similar distance from the 
small domain of the protein. Thus, the strong ssDNA binding 
subsite encompasses the bound 20-mer at its 5'-end, which is in 
close proximity to the small 12-kDa domain of the protein, 
while the weak binding subsite is located entirely on the large 
domain. At present, it is unknown whether or not the small 
domain or the hinge region of the DnaB protomer are directly 
involved in interactions with ssDNA. It should be noted that 
the isolated large 33-kDa domain of the enzyme could still bind 
ssDNA with some affinity, although quantitative analysis of 
the binding has not been performed (28). Thus, it is possible 
that the small domain and the hinge region constitute a part of 
the strong ssDNA binding subsite of the intact DnaB helicase. 

The DnaB helicase binds preferentially to the 5'-arm of the 
replication fork (26). Because the enzyme binds in a single 
orientation, with respect to the polarity of the ssDNA sugar- 
phosphate backbone, with the small domain facing the 5'-end 
of the ssDNA, it is evident that in the complex with the repli- 
cation fork, the helicase hexamer is oriented with the large 
domains of the protomers toward the duplex part of the fork, 
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while the 5 '-end of the arm of the replication fork is located in 
the vicinity of the small 12-kDa domains of the protomers. 
Anisotropy of the probe located at the 5'-end of the 20-mer 
bound to the helicase is significantly higher than the anisot- 
ropy of the same fluorescein residue located at the 3'-end of the 
nucleic acid (Table II). A significant decrease of the anisotropy, 
when the probe is located at the 3 '-end of the bound nucleic 
acid, indicates an increased mobility of the nucleic acid in the 
weak binding subsite and is most probably due to the lack of 
strong contacts between the single-stranded nucleic acid and 
the binding site. Recall that the micrococcal nuclease can ac- 
cess the part of the nucleic acid in the weak subsite of the total 
binding site of the DnaB helicase, suggesting a more open 
structure of the total binding site at the 3'-end of the bound 
20-mer. 

The results described in this work provide an insight into the 
complex structure-function relationship within the DNA bind- 
ing site of a replicative hexameric helicase. A model of the 
single, total DNA binding site on the DnaB protomer, engaged 
in the complex with the replication fork and based on the data 
presented in this work, is schematically shown in Fig. 8. The 
total DNA binding site of the enzyme is built of two subsites 
placed sequentially along the DNA substrate in the protein- 
nucleic acid complex. The strong ssDNA binding subsite oc- 
cludes the 5 '-end of the ssDNA, is located in close proximity to 
the small 12-kDa domain, and is distant from the duplex part 
of the fork. Binding of ssDNA to this subsite leads to the 
significant immobilization of the nucleic acid and provides the 
major part of the binding free energy. The subsite, which is 
located at the 3'-end of the ssDNA, binds the single-stranded 
nucleic acid very weakly. The single orientation of the helicase 
in the complex with ssDNA indicates that, when the enzyme 
approaches the replication fork, it faces the duplex part of the 
fork with the weak binding subsite located entirely on the large 
33-kDa domain of the protein. Thus, the weak binding subsite 
constitutes the entry site for the dsDNA in the fork. The more 
open architecture of this subsite provides a large space, which 
is necessary for the incoming duplex DNA. 

Comparison with other hexameric helicases is difficult be- 
cause, at this time, no analogous data on the structure of their 
nucleic acid binding sites are available. However, it is possible 
that similar functional and structural relationships within the 
DNA binding site are general for all other hexameric helicases. 
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BLAST; 2 s^tpeitee^ 



BLAST 2 SEQUENCES RESULTS VERSION BLASTP 2.2.3 [Apr-24-2002] 



x , . 3 BLOSUM62 ^ 

Matrix I ^Mgap open 



11 



gap extension: 



xdropoff: I 50 expect j 10 °j wordsize: I 3 Filter ^ ll!! 



Sequence 1 lcl|9306aa.txt 



Length 546 (1 546) 



Sequence 2 gi 2661771 helicase [Rhodothermus marinus] Length 945 (1 945) 

2 




NOTE: The statistics (bitscore and expect value) is calculated based on the size of nr 
database 

Score = 350 bits (899), Expect = le-95 

Identities = 186/407 (45%), Positives = 270/407 (65%), Gaps = 8/407 
(1%) 



Query: 36 RRNTRSTAKSKVQPVNDYGRIQPQAPELEEAVLGALMIEKDAYSLVSEILRPESFYEHRH 

RR TR+ + Q GR+ PQA EL E +AVLGA+ + 1 E +A EIL PE+FY+ RH 

Sbjct : 29 RRRTRAQIHALHQQA G RVP P Q AVE L EQAVL GAML I E P EAI P RAL E I LT P EAF YD G RH 

85 



Query: 96 QLIYAAITDLAVNQKPVDILTVKEQLSKRGELEEVGGPFYITQLSSKVAS SAHIEYHARI 
155 

Q 1+ AI L + VD+LTV E+L + GELE+ G Y+++L+++VAS+A++EYHARI 
Sb j ct : 8 6 QRI FRAIVRLFEQNRGVDLLTVTEELRRTGELEQAGDTI YLSELTTRVASAANVEYHARI 
145 



Query: 156 IAQKYLARELITFTSNIQSKAFDETLDVDDLMQEAEGKLFEISQRNMKKDYTQINPIIAE 
215 

IA+K L R +1 + + +A+D D +L+ E E ++F +S +++K +N ++ E 
Sbjct : 14 6 IAEKALLRRMIEVMTLLVGRAYDPAADAFELLDEVEAEIFRLSDVHLRKAARSMNEWKE 
205 



Query: 216 AYEQIQKAAARTDGLSGLESGYTKLDKMTSGWQKSDLIIIAARPAMGKTAFVLSMAKNIA 
275 

E+++ R G++G+ SG+ +LD +T GWQ+ DLI I IAARP+MGKTAF LS A+N A 
Sbjct : 206 TLERLEAIHGRPGGITGVPSGFHQLDALTGGWQRGDLIIIAARPSMGKTAFALSCARNAA 
265 



Query: 276 V — NFRNPVALFSLEMSNVQLVNRLISNVCEIPSEKIKSGQLAAYEWQQXXXXXXXXXXX 
333 

+ ++ VA+FSLEM QL RL++ + ++ ++G+L +W++ 
Sbjct: 266 LHPHYGTGVAIFSLEMGAEQLAQRLLTAEARVDAQAARTGRLRDEDWRKLARAAGRLSDA 
325 



Query: 334 XXXVDDTPSLSVFELRTKARRLVREHGVRIIIIDYLQLMNASGM-AFGSRQEEVSTISRS 
392 

+DDTPSL V ELR K RRL EH + ++I+DYLQLM AS M +R++E++ ISRS 
Sbjct : 326 PIFIDDTPSLGVLELRAKCRRLKAEHDIGLVIVDYLQLMQASHMPRNANREQEIAQISRS 
385 



Query: 393 LKGLAKELNIPIIALSQLNRGVESREGLEGKRPQLSDLRESGAIEQD 439 

LK LAKELN+P++ALSQL+R VE+R G KRPQLSDLRESG + D 
Sbjct: 38 6 LKALAKELNVPWALSQLSRAVETRGG — DKRPQLSDLRESGCLAGD 430 
Score = 76.6 bits (187), Expect = 5e-13 

Identities = 44/103 (42%), Positives = 59/103 (56%), Gaps - 5/103 (4%) 



i^* *V* i * * * till <ft^iftft**AftA^AAAA*^A^^ftftAAftW^ I'l^AAAM^^^^^^^WWWWhUAA^ 



Query: 42 8 SDLRESGAIEQDADMVCFIHRPEYYKIFQDDKGNDLRGMAEIIIAKHRNGAVGDVLLRFK 
487 

+D+ +IEQDAD+V FI+RPE Y I D+ GN G+AEIII K RNG G V L F 
Sbjct: 847 NDIIAHNSIEQDADWLFIYRPERYGITVDENGNPTEGIAEIIIGKQRNGPTGTVRLAFI 
906 



Query: 488 GEYTRFQNPDDDMVIPLPDAGAMLGSRMNNTGTVPPPPAEFAP 530 
+Y RF+N + + P+ G L + T +P P + AP 

Sbjct: 907 NQYARFEN LTMYQPEPGTPLPETPDET-ILPSGPPDEAP 944 

CPU time: 0.08 user sees. 0.03 sys. sees 0.12 total 

sees . 



Lambda K H 

0.317 0.134 0.378 



Gapped 

Lambda K H 

0.267 0.0410 0.140 



Matrix: BLOSUM62 

Gap Penalties: Existence: 11, Extension: 1 



Number of Hits to DB: 4104 

Number of Sequences: 0 

Number of extensions: 321 

Number of successful extensions: 5 

Number of sequences better than 10.0: 1 

Number of HSP's better than 10.0 without gapping: 1 

Number of HSP's successfully gapped in prelim test: 0 

Number of HSP's that attempted gapping in prelim test 

Number of HSP's gapped (non-prelim): 2 

length of query: 546 

length of database: 181,542,687 

effective HSP length: 124 

effective length of query: 422 

effective length of database: 140,313,307 

effective search space: 59212215554 

effective search space used: 59212215554 

T: 9 

A: 40 

XI: 16 ( 7.3 bits) 
X2: 129 (49.7 bits) 
X3: 129 (49.7 bits) 
SI: 41 (21.7 bits) 
S2: 73 (32.7 bits) 
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Sequence alignments unambiguously distinguish between 
protein pairs of similar and non-similar structure when 
the pairwise sequence identity is high (>40% for long 
alignments). The signal gets blurred in the twilight zone of 
20-35% sequence identity. Here, more than a million 
sequence alignments were analysed between protein pairs 
of known structures to re-define a line distinguishing 
between true and false positives for low levels of similarity. 
Four results stood out. (f) The transition from the safe zone 
of sequence alignment into the twilight zone is described by 
an explosion of false negatives. More than 95% of all pairs 
detected in the twilight zone had different structures. More 
precisely, above a cut-off roughly corresponding to 30% 
sequence identity, 90% of the pairs were homologous; 
below 25% less than 10% were, (if) "Whether or not 
sequence homology implied structural identity depended 
crucially on the alignment length. For example, if 10 
residues were similar in an alignment of length 16 (>60%), 
structural similarity could not be inferred, (iii) The 'more 
similar than identical 5 rule (discarding all pairs for which 
percentage similarity was lower than percentage identity) 
reduced false positives significantly, (iv) Using intermediate 
sequences for finding links between more distant famines 
was almost as successful: pairs were predicted to be 
homologous when the respective sequence families had 
proteins in common, All findings are applicable to auto- 
matic database searches. 

Keywords: alignment quality analysis/evolutionary conservation/ 
genome analysis/protein sequence alignment/sequence space 
hopping 



Introduction 

Protein sequence alignments in twilight zone 
Protein sequences fold into unique three-dimensional (3D) 
structures. However, proteins with, similar sequences adopt 
similar structures (Zuckerkandl and Pauling, 1965; Doolitfle, 
1981; Doolitfle, 1986; Chothia and Leak,. 1986). Indeed, most 
protein pairs with more than 30 out of 100 identical residues 
were found to be structurally similar (Sander and Schneider, 
1991). This high robustness of structures with respect to 
residue exchanges explains partly the robustness of organisms 
with respect to gene-replication errors, and it allows for 
the variety in evolution (Zuckerkandl and Pauling, 1965; 
Zuckerkandl, .1976; Dooiirtle, 1979, 1986). Structure align- 
ments have uncovered homologous protein pairs with less tha^ 
10% pairwise sequence identity (Valencia et al, 1991; Holmes 
et al., 1993; Holm and Sander, 1996; Brenner et al., 1996; 
Hubbard et al., 1997). Indeed, most similar protein structure 
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pairs appear to have less than 12% pairwise sequence identity 
(Rost, 1997). Furthermore, the average sequence identity 
between all pairs of similar structures is supposedly 8-10%, 
and the observed distribution (Gaussian peaking around 8% 
identity) marks another region, the rmdnight zone (Rost, 1997). 
The midnight zone is populated by protein structure pairs that 
may have become similar by convergent or divergent evolution 
(Doolitfle, 1994; Rost, 1997). Threading algorithms ultimately 
aim at revealing homologous pairs from the rmdnight zone 
(Wodak and Rooman, 1 993 ; Bryant and Akschul, 1995; Sipp], 
1995; Rost and Sander, 1996; Sippl and Floeckner, 1996; 
Fischer et al, 1996; Rost and O'Donoghue, 1997). Conven- 
tional sequence alignment methods become problematic at 
much higher values of sequence identity. Methods often fail 
to correctly align protein pairs with 20-30% pairwise sequence 
identity. Hence, Doohttle (1986) coined the term twilight zone 
for sequence alignments in this region. Do the dMauties 
of alignment methods in this zone reflect merely technical 
difficulties (statistical significance of detection), or is the 
twilight zone defined by a particular feature of evolution? 
Length-dependent cut-off for significant sequence identity 
Pairwise sequence identity (percentage of residues identical 
between two proteins) is not sufficient to define the twilight 
zone. Instead, analysing the relatively small number of structure 
pairs available in 1990, Sander and Schneider (1991) defined 
a length-dependent threshold for significant sequence identity 
The threshold curve defined (dubbed HSSP-curve) was roughly 
proportional to the inverse square-root of the length for 
alignments between 7 and 80 residues, and was clipped to 
saturate at 25% sequence identity over more than 80 residues. 
In 1990, no pair with more than 30 identical residues of 100 
aligned had different structures (Sander and Schneider, 1991). 
"Was this still true for the five times larger PDB (Bernstein 
et al, 1977) of 1997? 

Hopping in sequence space 

If we could plot me space of protein sequences, would we 
observe the protein families as islands? Unfortunately, we 
cannot telL Nevertheless, useful information has been extracted 
from sequence (Casari et al, 1995) and structure (Maiorov 
and Crippen, 1995) space. In everyday database searches, 
protein families are widened by exploiting the transitivity of 
homology (Pearson, 1996): (i) a query sequence U is aligned 
to a database, say SWISS-PROT(Bairoch and Apweiler, 1997); 
(ii) all sequences aligned at levels of significant siniilarity are 
used as new seeds U is and for each U s SWISS-PROT is 
searched again; (iii) this procedure is repeated i-mffl no new 
sequences are found. Sequence space hopping may be used in 
combination with knowledge from structures to widen families 
(Holm and Sander, 1997), or to increase the information 
contained in multiple sequence arguments input to prediction 
methods (Rost, 1996, 1997). Recently, the transitivity of protein 
families has been exploited successfully to automatically 
increase the yield in database searches [Ruben Abagyan 
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presented the 'multi-link recognition' method 1996 at me 
CASP2 meeting (Abagyan and Batalov, 1997); Pari et al 
(1997) presented the 'intermediate sequence search' method 
and Neuwald et al (1997) implemented the same concept 
(Neuwald, et al, 1997)]. Here, I confirm the original findings 
based on a different data set, and analysed in detail how the 
gain depended on the number of intermediate sequence, and 
their similarity. 

Here, I present results of aligning a set of 792 sequence- 
unique (no pair in set has more than 25% sequence identity) 
proteins of known structure against PDB. The following 
questions were investigated. Is the number of protein pairs of 
non-s imilar structures proportional to the distance from the 
HSSP-curve (eqn 1), or do false positives increase more rapidly 
in the twilight zone? Is the curve defined by Sander and 
Schneider (1991) still valid? Would using sequence similarity 
rather than identity improve accuracy (as speculated by Schne- 
ider and Sander)? Finally, can the accuracy be improved for 
pair alignments by expert rules? The results verify, partially, 
earlier work based on a 1000-fold larger data set (Sander and 
Schneider, 1991). The novel aspects were (i) a definition of a 
threshold for s imilari ty (eqn 2), and a refinement of the 
threshold for identity; (ii) an introduction of various expert 
rules. Aspects largely complementing other analyses were 
(Abagyan and Batalov, 1997; Park et al, 1997; Brenner et al., 
1998): (i) a large-scale evaluation of exploiting intermediate 
sequences (sequence-space-hopping); (ii) a detailed analysis 
of true and false positives providing estimates for accuracy 
and coverage of database searches; and (iii) a comparison with 
BLAST, one of the most popular methods for rapid databases 
searches (Altschul et al, 1990; Altschul and Gish, 1996). 

Methods 

Data set: 792 sequence-unique protein structures 
Protein databases are biased towards particular protein families. 
To reduce this bias, analyses are usually restricted to represent- 
ative data sets (Hobohm et al, 1992). Here, I chose the 
m a x i m a l set of sequence-unique proteins of known structure 
available in early 1997 (Holm and Sander, 1996). 'Sequence- 
unique 5 was defined as 'no pair in the set fells above the 
HSSP-curve (eqn 1; Sander and Schneider, 1991). As a rule- 
of-thumb, no pair had more than 25% pairwise sequence 
identity. Each of these proteins was aligned against the subset 
of PDB contained in the early 1997 release of the FSSP 
database of protein structure alignments (Holm and Sander, 
1996). This subset amounted in total to about 5646 protein 
chains. Obviously the second step (792 versus 5646) re- 
introduced bias into the results. However, ali gning the 792 
sequence-unique pairs against themselves would not have 
yielded any result for most of the twilight zone analysed here. 
Thus, 792 versus 5646 was the best compromise in reducing 
bias and monitoring the biased region. The resulting test set 
was the largest possible set of proteins for which stractural 
information was available (and thus false and correct hits 
could be automatically distinguished). 
Generation of sequence alignments 

Protein pairs were aligned by two different program types, 
(i) Full dynamic prograrmning as implemented in the Smith- 
Waterman (Smith and Waterman, 1981) based method 
MaxHom (Schneider, 1994) (McLachlan metric, with min- 
imum = -0.5, maximum = LOO, and gap open = 3, gap 
elongation = 0.3); and (ii) quick database searches as imple- 
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mented by the two versions of the BLAST series- BLAST? 
(Altschul et al., 1990; Altschul and Gish, 1996), and PSI- 
BLAST (Altschul et al, 1997). All 792 unique proteins 
were aligned against all 5646 proteins from the PDB subset 
Alignments shorter than 10 residues were not considered, as 
identical polypeptides of up 10 residues are known to occur 
in different structure states (Kabsch and Sander, 1984; Cohen 
et al, 1993). Technical limitations (CPU time) required the 
restriction of the dynamic>programmn^ analysis to the best 
2000 hits for each of the 792 unique proteins. (Note: this 
restriction applied only to the final displayed alignment. Of 
course, all possible combinations were explored initially by 
the alignment algorithm.) The resulting final data set comprised 
about 1.7 million pairwise alignments. For me comparison 
between the dynamic programming and the BLAST methods, 
the data set had to be reduced to all pairs that were aligned 
by all methods compared (the problem was that neither 
BLASTP, nor PSI-BLAST could be forced to report absolutely 
wrong, i.e. ALL pairwise alignments). 

Definition of sequence identity and sequence similarity 
(i) Pairwise sequence identity was defined by the percentage 
of residues identical between two aligned sequences (e.g. 
aspartic matching aspartic counts 1: D - D = 1; aspartic on 
glutamic was a non-match: D - E « 0). (ii) Pairwise sequence 
similarity was defined by the percentage of residues similar 
between two sequences (e.g. D - D *S 1; and aspartic on 
glutamic was now considered a match: D - E > 0). Similarity 
scores depend on the particular metric used to capture physico- 
chemical properties of amino acids (note: most ami™ acids 
are not considered 100% similar to themselves by typical 
metrices, as such metrices are based on log-odds, e.g. for the 
McLachlan metric only F, W } Y and C yield 100% self- 
srmilarity). Consequently, levels of sirnflarity are not directly 
comparable between different metrices. For comparability, I 
used the McLachlan metric (Gribskov et al, 1987) also used 
in the HSSP database (Schneider et al, 1997). In principle, 
there are two ways to convert similarity into percentage values: 
(i) by no rmalizin g the sirmlarity score by the mayi™^ possible 
score observed in a given metric (percentage residue similarity); 
and (ii) by setting an arbitrary threshold of the similarity score 
to distinguish similar-not similar and counting the percentage 
of residues that are similar according to this threshold (percent- 
age of similar residues). Again, I followed the practice of the 
HSSP database compiling the percentage residue similarity 
(normalized by maximal possible scores). When compiling 
percentages, the number of identical residues was normalized 
by the number of residues aligned, gaps were ignored. 
Standard of truth for structural similarity 
Similarity between two protein structures is not uniquely 
defined. Different structure alignment methods yield different 
scores (Alexandrov et al., 1992; Holm et al., 1993; Luo et al., 
1993; Orengo, 1994; Crippen and Maiorov, 1995; Gerstein 
and Levitt, 1996; Holm and Sander, 1996; Orengo and Tayloi, 
1996; Zu-Kang and Sippl, 1996). Such differences can be 
substantial, as ilrostrated by differences between the expert- 
based database of structural alignments SCOP (Murzin et al 
1995; Brenner et al, 1996; Hubbard et al, 1997), and the' 
automatically generated databases CATH (Orengo et al., 1993 
1997) and FSSP (Holm and Sander, 1996). In general, 5 FSSP 
tends to find more pairs of similar structure than do CATH 
and SCOP. However, this is only a trend. For many examples, 
SCOP finds stmctural similarity and FSSP does not Here, I 




Fig. I. Sketch of sequance-space-hopping. The triangle defines three search 
proteins (A, B and C) having mutually less than 25% sequence identity. The 
circles define the three families (all sequences inside the circle indicated by 
arbitrary names aaa_jpecies have more than 25% sequence identity to the 
respective search proteins A, B and C). Sequence-space-hopping implies 
joining the circles representing the protein families (as shown for proteins A 
and B in the striped circles) if they contain identical proteins that are 
aligned in the same region (ab_cvb in the example given). 

chose the FSSP database 'a standard of truth': any pair for 
which FSSP listed a significant score [zDALI > 4 (Holm and 
Sander, 1996)] of structural similarity was considered to be 
structurally similar. In order to distinguish between true and 
false positives this decision implied mat all pairs not listed at 
the given cut-off of the FSSP database were stracturally not 
similar. However, this brought up the problem of different 
structure alignment methods. For example SCOP may consider 
apair stracturally similar, andFSSP may not Thus, additionally 
all pairs were excluded from the analysis that were listed in 
FSSP but with lower z-scores. Even that still left pairs of 
proteins with clear levels of sequence identity (more than 
40%) which were not found listed in FSSP. Thus, I had 
to refine this procedure by serm-automaticalry checking the 
structural similarity for about 2000 protein pairs all of which 
had levels of above 30% pairwise sequence identity [note this 
number was negligibly small, as only 1% of all pairs were 
found above this value (Fig. 2B)!]. The particular way in which 
the standard-of-truth was constructed implied that estimates for 
true positives might be slightly optimistic, estimates for false 
negatives slightly pessimistic. 

Concept of true and false hits 

When Chothia and Lesk (1986) first analysed the relation 
between sequence and structure similarity, they monitored the 
details of structural differences, and found that the differences 
are inversely proportional to the level of sequence identity. 
The binary notion of 'similar structure' (true or false) used in 
this analysis reflected a different focus: the goal was to estimate 
the accuracy in correctly detecting rather fhari hi correctly 
aligning homologues. Did this imply that correct detection and 
correct alignment were not correlated (as often the case for 
threading: Bryant and Altschul, 1995; Loner et al., 1995; 
Sippl, 1995; Fischer et al, 1996)? Not necessarily but the 
fact is that two homologues can be detected although part or 
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Fig. 2. Explosion of structurally dissimilar pairs in the twilight zone. 
Numbers of tine (pairs with similar structure) and of false positives (pairs 
with no similar structure) plotted versus the distance to the HSSP- curve 
(Sander and Schneider, 1991), U. the horizontal axes give me distance from 
the threshold defined in eon 1 (numbers refer to the parameter n in eqn 1) 
"The levels of pairwise sequence identity corresponding to the distance were 
shown on top. (A) Number of pairs observed at any distance (logarithmic 
scale). (B) Cumulative number of pairs observed (logarithmic scale). For 
example, at a threshold corresponding to about 32% sequence identity for 
long alignments, the numbers of true and false positives were equal (arrow 
in A); at about 29% even the cumulative numbers of true and false positives 
were equal (arrow in B). Note: numbers of true negatives and false 
negatives result from the cumulative sums left of the threshold; percentages 
of true and false positives given in Figure 5. 

even the entire— alignment is wrong. (However, this extremely 
irritating point was not pursued further in this analysis.) 
The following cases were distinguished: (i) true positives, 
alignments between proteins of similar structure that fall above 
a given threshold (denned by the sequence alignment method); 
(ii) false positives, alignments between proteins of dissimilar 
structure that fall above a given threshold of the sequence 
alignment; (iii) true negatives, alignments between proteins of 
dissimilar structure that fall below a given threshold; and 
(iv) false negatives, alignments between proteins of similar 
structure that fall below a given threshold. Note that 'negatives' 
and 'positives' represent two sides of the same coin: at 
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Fig. 3. Pairwise sequence identity versus alignment length. The original 
HSSP-curve (Sander and Schneider, 1991) (dotted circles, eqn 1) appeared 
to fit the true positives (homologues. A) better than the felse positives (B). 
In contrast, the new curve proposed here (filled diamonds, eon 2) was more 
conservative in excluding false positives. Note that due to the huge number 
of pairs the plots for true (A) and false (B) positives appeared almost 
equally densely populated (Figure 2 revealed the problem of such a scatter 
plot). 

any threshold extracted from the sequence alignment n } the 
following equations hold (for cumulative numbers): 

false negatives + true positives = all pairs of sirnilar structure 

true negatives + false positives = all pairs of 
dissimilar structure. 
Distance to HSSP threshold 

The HSSP-curve was originally defined by (Sander and 
Schneider, 1991): 

290.15 -Z-°- 562 , foxK 80 

25 , doe Z, 2*80 (1) 

where L gave the number of residues aligned between two 
proteins; p 1 the cut-off percentage of identical residues over 
the L aligned residues; and n described the distance in 
percentage points from the curve (n = 0 corresponds to the 
original HSSP-curve; n = 5 to the official HSSP database 
releases; curve plotted in Figure 3). Once Schneider and Sander 
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An) = n + 



(1991) had discovered the basic functional dependence between 
sequence identity and alignment length, they merely had to 
fix two free parameters: the factor and the exponent. Both 
were chosen to fit the data observed in 1991, in particular to 
reach values of 25% around alignment length of 80, and 
values of 100% around alignment length of 10. The principle 
functional dependence described by eqn 1 also follows 
from statistics, as was recently shown in an elegant work 
(Alexandrov and Soloveyev, 1998). Let p { p = 1,..., 20) be the 
probability that amino acid 2 occurs in a protein, and my the 
score for randomly aligning two amino acids i and j. The score 
S of an entire alignment can then be approximated by: 

S = <m> * L 

where <m> is the expectation value of m if , and L the alignment 
length If the values of my are independent, Gaussian distributed 
variables, it follows (after some elementary operations) that 
the relation between the standard deviation of the values of 
m ij (G m ), and the resulting score distribution (o s ) is: 

In their original article Alexandrov and Soloveyev work 
out an appropriate re-scaling of the dynamic programnung 
alignment However, this scheme cannot be applied after the 
alignment has been completed (as the threshold functions used 
in this work), rather it has to be implemented into the 
alignment method. 

New curve for length-dependent significance of pairwise 
sequence identity 

I attempted to solve the problems of the original HSSP-curve 
(eqn 1; Results) by defining the following curve for the 
separation of true and false positives (Figure 3, grey line with 
dotted circles): 



p J (n - n + 480 • Zr 0 * 32 ' 0 + 



(2) 



where L gave the number of residues aligned between two 
proteins; jf the cut-off percentage of identical residues over 
the L aligned residues; and n described the distance ia 
percentage points from the curve (n « 0 plotted in Figure 3). 
The constraints in visually selecting the final function were (i) 
to maintain the functional form denned by eqn 1 (and suggested 
by the statistics of Alexandrov and Soloveyev, 1998); (ii) to 
hit the 100% mark at alignments that are too short to reveal 
anything about structural similarity (=11 residues); (iii) to 
saturate at levels around 20% sequence identity (reached for 
length = 300); and (iv) to roughly reflect the observed gradient. 
Saturation for long alignments was realized by the functional 
form of the exponent (note: the term + e" Ua resulted in an 
exponential decay). This 'saturation' constraint also afflicted 
the particular value of the factor (0.32 rather than about 0.5 
as suggested by the distribirtion of the data, Figure 4). 

New curve for length-dependent significance of pairwise 
sequence similarity 

The original HSSP-curve was derived for sequence identity, 
not for sequence similarity (Sander and Schneider, 1991). The 
functional dependence between similarity and length appeared 
comparable to the one between identity and length (Results). 
This prompted a similar definition for the separation between 
true and false positives based on simttarity: 



+ 420 ■ Zr 0 - 335 ' (1 + t- Lrma ) 



(3) 
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Fig. 4. Pairwise sequence similarity versus alignment length. (A) Correctly 
detected structural homologies; (B) false positives. Open circles, original 
HSSP-curve (Sander and Schneider, 1991) (eqn 1); filled triangles, new 
curve proposed here (eqn 3). 

where L gave the number of residues aligned between two 
proteins; p? defined cut-off for the percentage of residue 
similarity over the L aligned residues; and n described the 
distance in percentage points from the curve (n = 0 plotted 
in Figure 4). 

Sequence-space-hopping 

Suppose proteins A Q and B Q were less than 25%. identical; 
femily A is given by: {A 0s A x ,..., jQ (such that all proteins in 
the family^ are more than 25% identical to A 0 )\ analogously 
family B is given by: {B 0 , B n }, Although A 0 and B 0 

differed by more than 75%, it may well be true that both were 
aligned to the same sequences, i.e. that for some i and j: Ai = 
Bj. If this is the case, 'sequence- space-hopping' refers to 
simply extending both families A and B to become: {Aq, A\,... t 
A*. B 0 , B u ... 9 B m ) (Figure 1). Technically, I described this 
situation by compiling a simple matrix H(A.B) that contained 
the number of overlapping proteins (Le. those contained both 
in family A and B) between all proteins in the test set (792 
chains) and all proteins in the search set (5646 chains). For 
example, H(A,B) = 5 implied that test protein A and search 
protein B had five identical proteins in their family alignments. 



The family alignments were taken from the HSSP database 
(Schneider et al, 1997) with a cut-off at: HSSP-curve + 10% 
(n = 10 in eqn 1), i.e. for alignments longer than 80 residues, 
35% pairwise sequence identity was required. All protein .pairs 
(A,B) in the twilight zone were investigated for which H(A,B) 
was larger than zero. Note, the concept of sequence-space- 
hopping explored here is being used in everyday sequence 
analysis. The novel idea introduced by others (Abagyan and 
Batalov, 1997; Neuwald et al, 1997; Park et al, 1997) was 
NOT to use sequence-space-hopping, but to use it for reducing 
false positives in large-scale sequence analysis. Here, I simply 
applied this concept was applied to the large data set explored, 
and investigated its usefulness in dependence on various 
parameters. 

More-similar- than-identical rule 

A simple rule-of-thumb was explored: accept hits only if the 
level of sequence srmilarity was higher thai) the level of 
sequence identity. This rule may appear to be non-selective in 
that srmilarity would always be larger than identity; however 
for the given definition of similarity (using the McLachlan 
metric), this was not the case. - 

Results 

Number of false positives exploded in twilight zone 
In contrast to 1990, when Sander and Schneider (1991) 
compiled their data, now protein pairs of dissimilar structure 
were detected above the 30% cut-off (Figure 2A). And these 
were not exceptions: at a level of 32% (HSSP-curve 4* 7%, 
i.e. n = 7 in eqn 1), the number of false positives already 
equalled that of homologues. For the original HSSP-curve the 
number of false positives was 20-fold higher than the number 
of true pairs. The transition from 20 to 30% sequence identity 
was highly non-linear for true, and false positives (logarithmic 
scales in Figure 2): the number of true pairs rose by a factor 
of 5, that of false pairs by a factor of 200 (Figure 2B). Thus, 
below the region of significant pairwise sequence identity 
(>34%) the population of false positives exploded. However 
also the vast majority of homologues had less than 30% 
sequence identity. 

Functional shape of original HSSP-curve adequate 
The functional shape of the original HSSP-curve proved to be 
basically correct. (Figure 3, grey line with triangles). However, 
the larger data set analysed here revealed several problems in 
detail (Figure 3B). (i) A threshold of 25% was not reasonable 
for an alignment length below 150-200 residues, (ii) Above 
an alignment length of about 100 residues, the derivative of 
the curve separating true and false positives should be lower 
than at lengths below 80. 1 attempted to solve these problems 
by defining a new curve for separating true and false positives 
(eqn 2; Figure 3, grey line with dotted circles). The particular 
functional form guaranteed an approximate saturation for long 
alignments. For alignments shorter than 11 residues eqn 2 
yielded values above 100%. However, this was acceptable as 
100% identity for fragments of 10-11 residues does not imply 
stmctural similarity (Cerpa et al s 1996; Minor and Kim, 1996; 
Mufioz and Serrano, 1996). The new curve saturated around 
20% for alignments over more than 250 residues. 
Defining a curve for pairwise sequence similarity 
Compiling sequence identity neglects the physico-chemical 
nature of amino acids. Any multiple sequence ahgnment 
illustrates that, for example, the feature hydrophobicity is more 



89 



BJtofft 



true positives 



false positives 



c 3 10 4 
1 
u 

1 

I 8103 

I 610» 
U 410 5 



■ i r 




V 


A 

■ 

















10 -10 







: B 










^jim [, ^ P^OpO© 


! 1 






! " M 



10 6 
10 s 
l(f 
I0 3 
10 2 
10 1 
10° 



15 




0 5 
Distance from threshold 



Distance fromtliresho3d 



• original HSSP-curve 

trivial approach: 
identity independent 
of alignment length 



■ new curve: 
sequence identity 

■ new curve: 
sequence similarity 



-rule of thumb: 
similarity > identity 



^■ A ^ for detecting homologies in the twilight zone. How to choose the cut-offline for automatic database searched Th* ™h< 



conserved than is the residue type. For the million protein 
pairs investigated here, this was reflected in a shift of the 
scatter plot towards lower percentages (Figure 4). In particular, 
for longer alignments false positives fall below 15% pairwise 
sequence similarity. This prompted the introduction of a 
threshold specifically for sequence similarity ( e qn 3 in 
Methods; Figure 4, grey line with dotted circles). The curve 
surpassed 100% for alignments shorter than 12 residues and 
saturated at about 10% for alignments over more than 500 
residues. 

Better detection of homologues in twilight zone by new 
curves 

The new curves for length-dependent cut-offs in sequence 
identity (eqn 2) and sirnilarity (eqn 3) resulted in clearly lower 
false positive rates (higher accuracy) thai? the original HSSP- 
curve (Figure 5B and C). This was paid for by a lower number 
of true positives detected (lower coverage; Figure 5A). At the 
n = 0 (eqn 1-3), the old curve yielded about twofold more 
true positives, but more than 20-fold more false positives 
compared to the new curves for identity and similarity. Further- 
more, at any level of true positives detected, the number of 
false positives was smaller for the new curves (eqn 2-3) than 
for the original HSSP-curve (eqn 1 ; Figure 7). When applying a 
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cut-off according to mere sequence identity (ignoring alignment 
length), accuracy dropped below 10% at levels of 30% sequence 
identity (Figure 5C). Thus, detection accuracy rose almost 
10-fold by the new curves. 

Improving detection accuracy by expert rule 
Experts often apply rules-of-thumb to visually distinguish true 
and false positives. However, many of such simple rules 
appeared not valid for automatic implementation, m particular, 
the distributions of the number and length of insertions did 
not, on average, differ between false and true positives (data 
not shown). Detection accuracy improved marginally by apply- 
ing the following rules: (i) compile the distance for the 
similarity score V s (eqn 3), and the identity score ri (eqn ?) 
average over both + ^]/2), and accept pairs when this 
average is above some threshold n\ (ii) take pairs whenever 
either identity or similarity surpassed the respective threshold 
(either r? U n 1 > n)\ (hi) take pairs if both values where above 
a given cut-off {n 3 U n J > n). In contrast, detection accuracy 
increased significantly by applying the 'moie-sinmar-than- 
identical' rule: accept hits found in a database search only if 
percentage sirnilarity is larger than percentage identity. This 
constraint resulted in >98% detection accuracy at n = 0 cut- 
off levels (eqn 2-3), while 2-4-fold less true positives were 
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Fig. 6. Improving accuracy by sequence-space-hopping. Distances were 
compiled according to the old curve (eqn 1, 'old'), and to the new curve for 
identity (eon 2, *ide'). Corresponding levels of sequence identity shown on 
top. The cumulative percentages of true positives detected at a given cut-off 
distance were compiled for three different hopping strategies: hits were 
accepted if, at least, one (H(A,B) « 1), five (H(A3) = 5) or 10 (H(A,B) = 
10) proteins were common between two protein families (Methods). 

(A) Cumulative percentage of true positives (false positives = 100 - true); 

(B) cumulative number of true positives. The comparison of the true 
positives reached by intermediate sequences and ail true positives (grey line 
in B, note: same as in Figure 2) showed that (i) less than 1/1000 of the true 
positives were reached by intermediate sequences; (ii) the number of pairs 
reached by intermediate sequences did not explode in the twilight zone 
(scale on the left covers two orders of magnitude that on the right only 
one). Numbers for true and false negatives would not make sense for this 
analysis: as we don't .know all proteins, we cannot conclude that two 
families are unrelated only because we don't find a link between thwn 



found at this level (Figure 5A and C). Hence, applied as a 
conservative cut-off in automatic database searches, this rule 
proved rather powerful. 

Improving detection accuracy by sequence-space-hopping 

Hopping in sequence space proved successful in discarding 
false positives. Already the minimal constraint to accept a pair 
if at least one protein was common between the two sequence 
families yielded levels of around 80% accuracy even down 
to cut-off levels corresponding to 20% sequence identity 
(Figure 6A, compared with <20% accuracy for the normal 
thresholds Figure 5C). Accuracy increased further when more 
proteins were required to be common to both families 
(Figure 6A). However, sequence space hopping was possible 
for only relatively few protein pairs (Figure 6B). Furthermore, 
the improvement in accuracy was less clear using sequence- 
space-hopping than by applying the ' more- sirnilar-than-i dent- 
icaT rule (Figure 5). 



Accuracy versus coverage for BLAST and Jul! dynamic 
programming 

The balance between accuracy (percentage of true pairs) 
and coverage (percentage of all true pairs) enables choosing 
automatic thresholds according to a particular purpose of a 
database search. It also permits comparing different methods 
(the higher the values, the better), (i) As expected, the 
commonly used simple level of sequence identity (disregarding 
alignment length) proved, again, an extremely bad choice, 
(ii) Surprisingly, the fast database searching method BLAST 
performed relatively well in comparison to the full (rynamic 
progra mming (Figure 7A). (iii) Both BLAST? version 2 
and PSI-BLAST were almost as good as the full dynamic 
progra mm i ng with the previously denned HSSP-threshold 
(Sander and Schneider, 1991). (iv) Best performance was 
achieved by the new toeshold for similarity (eqn 3). (v) How- 
ever, the raw alignment score performed almost as well, 
(vi) BLAST? (Altschul et al, 1990) performed rather similarly 
to the more elaborate and more recent PSI-BLAST (Altschul 
et al, 1997) (and for 'high' accuracy even slightly better, 
Figure 7A inset; note: given that standard parameters were 
chosen, this was not surprising). The corresponding thresholds 
were given in Figure 5B for the dynamic programming, and 
in Figure 7B for the PSI-BLAST probabilities. 

Many false negatives at reasonable cut-off values 
The number of false negatives is often of interest, i.e. the 
number of proteins that belong to a structure family but were 
not detected above a given cut-off. For the data sets used here, 
the cumulative percentage of false negatives was extremely 
high for all reasonable cut-off levels (Figure 5D). The vast 
majority of all pairs of proteins with similar structure populate 
the midnight zone below 10% sequence identity (Rost, 1997). 
Thus, the extremely high false negative rates proved that 
methods aligning two proteins merely based on the pairwise 
levels of sequence homology clearly fail to find the gold mine 
of database searches (and that older analyses that failed to 
describe this effect were based on biased data sets). 

Thresholds for practical use 

For simplicity the functions (eqn 1-3) were explicitly provided 
in tables (Rost, 1998). At levels of n = 0 (eqn 1-3) the 
cumulative number of true positives were (Figure 5): HSSP- 
curve (eqn 1), 12%; new identity curve (eqn 2), 56%; new 
siinilarity curve (eqn 3), 73%. In order to achieve levels of 
99% correct hits m percentage points have to be added to the 
curves, where m was HSSP-curve, m = 8; new identity curve, 
m — 5; new similarity curve, m = 12. For comparison, 
applying the 'more-sirnfl ar-than-identjcaT rule yielded levels 
above 99% down to m = -1. 

Conclusions 

Rapid transition from trivial to needle-in-haystack problem 
The twilight zone of sequence pair alignments (20-3 5% 
pairwise sequence identity) was characterized by two non- 
linear transitions, (i) The number of homologues (true positives) 
rose by a factor of about eight (Figure 2A). I obtained a 
similar result from analysing the first four entire genomes 
(Rost, 1997) which indicated that this result was general, rather 
than database dependent, (ii) The number of false positives 
rose by a factor of 5000 (Figure 2B). Hence, separating true 
and false positives switched from a trivial task (above 35%) 
to the problem of finding needles in a haystack (20-30%). 
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Fig. 7. Accuracy versus coverage for various methods and thresholds. 
Accuracy was defined as the cumulative percentage of true positives (actual 
true/all actual), coverage as the percentage of true positives that were 
detected at a given threshold (actual true/all true). (A) Thresholds and 
methods showed: Aidenxiiy, new threshold for length-dependent sequence 
identity (eon 2); Asimilariry, new threshold fox length-dependent sequence 
similarity (eon 3); HSSP-curve, curve proposed by Sander and Schneider 
(1991; eqn 1); %idenmy\ threshold given by sequence identity alone, i.e., 
disregarding alignment length; alignment score, score used for the dynamic 
programming optimization MaxHom; bias a, BLAST? version 2 (Altschul 
and Gish, 1996); psi-bitut, BLAST? version 3 (Altschul et al., 1997), run 
with standard parameters. The values for the BLAST methods were based 
on the probability scores reported by these algorithms. The BLAST methods 
did not report ail pairwise alignments, thus the data set had to be reduced to 
the subset for which aligned pairs were reported by all three methods 
(MaxHom, BLASTP2, BLASTP3). Note that whereas the curves for the 
BLAST methods, as well as for identity and similarity are likely to hold up, 
in general, the curve for the alignment score is valid for the particular 
implementation of the dynamic programming in MaxHom, and for the 
particular choice of parameters (Methods). (B) Detail of the relation 
between the BLAST probability (here for psi-blast), and the cumulative 
number of true/false hits, as well as percentage accuracy and coverage. 



The explosion of Mse positives shed light on the shape of 
sequence space. From 100-35% sequence identity, any residue 
exchange resulting in a stable structure maintains structure. 
From 28-35% sequence identity, most residue exchanges 
maintain structure. From 20-28% sequence identity, the 
absolute majority of residue exchanges forming stable struc- 
tures populate different protein families. Is the explosion 
caused by features of structure space? If one generates protein 
sequences at random (or randomly superposes non-related 
proteins), the counts for most of the region above 10% 
sequence identity are negligible (Rost 1997). Thus, although 
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it is obvious that we expect to find more pairs for lower levels 
of sequence identity based on mere statistics, the particular 
transition in the twilight zone seems not to be evident However, 
this analysis did not provide answers to whether or not the 
observed explosion may reflect structural (Chung and Subbiah, 
1996) and/or functional constraints. 

Poor distinction between true and false positives by 
sequence identity, alone 

Even journals such as Cell, or EMBO provide an ample source 
for the following fallacy: 'these two fragments of 16 residues 
adopt similar structures as they have more than 10 similar 
residues 1 . Thus, one of the most important messages of this 
analysis might be the repetition of a point made by others 
(Sander and Schneider, 1991): high levels of sequence similar- 
ity or identity do not ascertain structural similarity (Figure 5). 
Instead, the levels of significant sequence identity and similarity 
depend on the alignment length (Figures 3 and 4), or the 
respective raw score of the alignment methods. 

Better distinction by new curves for sequence identity and 
similarity 

The length-dependent cut-off for significant sequence identity 
pioneered by Sander and Schneider (1991) needed refinement 
m several ways to account for the findings from a 1000-fold 
larger data set: (i) shift towards higher values for shorter 
alignments; (ii) saturation for alignments longer than 150 
residues; (Hi) definition of new curve for levels of sequence 
similarity. These tasks were solved by introducing threshold 
curves for significant sequence identity (eqn 2), and for 
significant sequence similarity (eqn 3). The precise definition 
of the two thresholds was entirely empirical. However, the 
essential functional dependency of the curves was kept similar 
to what would be expected from pure statistical considerations. 
Although not true for all problems (Nielsen et al., 1996), cm 
average, sequence similarity was marginally more successful 
than identity in distinguishing true and false positives. The 
new curves improved accuracy at a given coverage (Figure 5 
and 7). Additionally, this analysis supplied detailed levels for 
expected accuracy and coverage for the curves defined, as 
well as for standard BLAST searches (Figures 5 and 7). 
Such estimates may have implications for automatic database 
searches. They also shed light on the comparison between 
sequence alignments and threading techniques that both only 
make use of pair comparisons (rather man using family specific 
profiles): already at levels of 25% sequence identity, pair 
alignments detect only 10-30% true positives. This is below 
the level of what threading techniques achieve in the interval 
0-25% sequence identity (Sippl, 1995; Fischer and Eisenberg, 
1996; Russell et al, 1996; Rost et al, 1997). 

Improved accuracy by 'more-similar-than-identical' rule and 
sequence space hopping 

The number of false positives was significantly reduced by 
two techniques (only the first of which was novel to this 
work), (i) The 'more-similar-than-identical 3 rule: 95% of all 
pairs for which percentage similarity was larger than percentage 
identity had similar structures. Thus, this constraint clearly 
improved detection accuracy. The cost was low coverage: for 
only 10% of the structurally similar pairs the percentage 
similarity was larger than percentage identity. This might be 
explained by the fact that half of me protein, on average, 
embedded in loop regions, may tolerate residue exchanges that 
do not conserve physico-chemical properties (and thus decrease 
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the overall average more than the few to-be-conserved-regions 
increase it), (ii) The usage of 'mxdti-links 5 (Abagyan and 
Batalov, 1997), 'intermediate sequences' (Park et al 1997) 
'transitivity 3 (Neuwald et dL, 1997), or 'sequence space 
hopping': most protein pairs that contained a similar subset of 
identical proteins in their respective sequence families were 
found to have similar structures even at low levels of sequence 
homology. Obviously, the validity of trarAsitrvhy (detection 
accuracy) between protein families (Figure 1) depended on 
the distance between the families (Figure 6). Interestingly, the 
improvement of accuracy hardly depended on the number of 
proteins required to be common to two families. This suggested 
that although the vast majority of protein pairs with 25% 
sequence identity had dissimilar structures, the 'islands' popu- 
lated by structure families were well separated Unfortunately, 
for the data set explored here, the yield of this analysis was 
found to be very low: on average only one in 1000 pairs was 
reached via intermediate sequences (Figure 6). Furthermore, 
sequence-space-hopping resulted in clearly lower coverage/ 
accuracy ratios than did the application of the 'more-similar- 
than-identicaT rule (Figures 5 and 6). 

Beginning of the 90s: over-estimation of sequence alignment 
methods 

Until 1996, very few people had taken up the laborious 
task of objective large-scale analyses of protein sequence 
comparisons. Partially, because automatic structure comparison 
methods are fairly recent The few earlier workers (Sander 
and Schneider, 1991; Vogt et aL, 1995; Gotoh, 1996) based 
their work on data sets of about 1000 pairs of protein structure 
alignments. Gotoh (1996) and Vogt et al (1995) used the same 
set (PascareUa and Argos, 1992) for testing different alignment 
methods, and a variety of substitution metric es. They focused 
on monitoring the detailed accuracy in terms of number of 
residues correctly aligned. Due to the small data set Vogt et al 
(1995) found about 98% true positives at 30% sequence 
identity (ignoring alignment length), and 50% true positives 
at 20% sequence identity. For the 1000-fold larger data 
set used here the corresponding values were quite different 
(ignoring alignment length): 11% true positives at 30% 
sequence identity, and 5% true positives at 20% identity. 
However, even the more conservative analysis introducing 
the importance of alignment length for levels of significant 
sequence identity (Sander and Schneider, 1991) still over- 
estimated the possible levels of sequence identity between 
proteins of dissimilar structure. 

End of the 90s: database searches do not reach the 
gold mine, yet 

The thresholds for sequence identity and similarity denned 
here, as well as those established by others (Abagyan and 
Batalov, 1997; Brenner et al, 1998) complemented the levels 
for 'significance' provided by BLAST (Altschul and Gish, 
1996), FASTA. (Pearson, 1996) or other statistical analyses 
(Bryant and Altschul, 1995) by addressing the question 'how 
significant is the significance of the respective alignment 
method?'. Based on quite different data sets the principal 
messages were similar (i) most proteins of similar structure 
were not found by pairwise sequence comparisons at reasonable 
cut-off thresholds; (ii) raw scores from dynamic programming 
methods were comparable to the original length-dependent 
cutoff thresholds for sequence identity (Sander and Schneider, 
1991); (iii) dynamic programming was only slightly superior 
to BLAST searches (Altschul and Gish, 1996; Altschul et al, 



1997). However, in detail the numbers differed between the 
recent analyses. Obviously, the absolute values depended 
crucially on the particular choice of the data set. Abagyan and 
Batalov (1997) analysed various substitution metrices on a 
data set comparable to the one used in this analysis. They 
concluded that raw alignment scores provide better separations 
between true and false positives than do length-dependent 
cut-offs for sequence identity and similarity. The difference 
between their result, and the one shown here may result from 
the fact that Abagyan and Batalov (1997) used the optimal 
choice of all parameters for comparing the raw alignment 
score to sequence identity and similarity. Brenner and co- 
workers have analysed the accuracy and coverage for various 
statistical scores (Brenner et al, 1998). They used a completely 
different data set than I did An approximate comparison of 
the two analyses was possible by the reference point of 
simple identity (ignoring alignment length). It seems that the 
performance for the best separation method they find (new 
FASTA) was comparable to the improved, simple thresholds 
defined here (eqn 2-3). Here, the BLAST probability was 
found to be a relatively good way to separate true and false 
positives (Figure 7A): it was only slightly inferior to the raw 
dynamic programming alignment score, results for which hold 
up exclusively for the particular choice of parameters and the 
particular alignment algorithm used 

Thresholds in practice 

The advantages of the length-dependent levels of identity and 
similarity (eqn 2-3) over other thresholds (Abagyan and 
Batalov, 1997; Alexandrov and Soloveyev, 1998) was that 
these thresholds, in principle, are applicable to any alignment, 
and may relate more explicitly to structure. Identity and 
similarity can be compiled easily without having to re-do the 
entire database search. In practice, this does not always hold 
up: (i) different parameters (e.g. the way in which gaps are 
treated) may result in different alignments; and (ii) the similar- 
ity values compiled hold for the choice of a particular metric 
(here McLachlan), Additionally, the thresholds introduced here 
provide independent evidence for the separation, and permitted 
the application of the successful t more-similar-than-identica3 , 
rule. 

Will the analysis hold up for the next 500 structures? 
The results given here based on the largest possible data 
set for which structural alignments provided a well-defined 
distinction between true and false. One conclusion was that 
seven years ago (Sander and Schneider, 1991) the database 
was too small to capture the details. Will this also be true in 
2005? Answers have to remain speculative, (i) Although the 
database used in 1990 was 1000-fold smaller than the one 
used here, some principle findings were verified, (ii) Assuming 
that there are only 1000 folds in nature (Chofbia, 1992), and 
that these correspond to about 10 000 families, then even the 
full catalogue of all protein sequences would yield a data set 
essentially only 30 times larger than the one used here (note: 
the data set used corresponded to about 300 different folds 
aligned against about 1000 families). 

Rather more accurate, or more sensitive? 
An accurate and sensitive distinction between true and false 
positives is important for automatic database searches. The 
new curves introduced here (eqn 2-3) proved slightly more 
sensitive (higher coverage) and more accurate than the previ- 
ously proposed curve (Sander and Schneider, 1991). The 
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accuracy increased significantly by applying the 'more-similar- 
thar>identical 5 rule, and by sequence space bopping. However, 
accuracy was gained at the expense of coverage. Which is more 
important? Clearly, the evolutionary information contained in 
multiple alignments is the single most important contribution 
to improving protein structure prediction in the 90 's (Rost and 
Sander, 1996; Rost and O'Donogbue, 1997). Is the gain by 
increased diversity more important than the loss of accuracy 
when using alignments for structure prediction? The answer 
depends on the particular prediction goal. For example, for 
secondary structure prediction diversity is more important than 
accuracy (cut-off at 25% versus that at 30%), whereas for 
the prediction of solvent accessibility the opposite is true 
(unpubhshed). Furthermore, as databases grow coverage may 
be less important than accuracy. Irrespective of individual 
preferences, the sharper the knife cutting between true and 
false positives, the better. This analysis has sharpened the 
knife a little, and added new optional tools to it 
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EXHIBIT 7 



CLUSTAL W (1.8 J multiple sequence alignment 



gi | 44 06 210 | gb | AAD199 01 . 1 | 

gi I 2 661771 1 emb| CAA7414 0 .1 | 

gi I 4416322 j gb| AAD2 0314 . 1 1 

gi 1 12642370|gb|AAK00231.l|AF22 

93 06aa.txt 



MAE FEER PRLS I GEEE APP Y P LE KLTGGRRRTRAQ I HALH 

MNEITTSEQLDLQ 

WKIDNRELSLPTFCGCAQPSTIHFLYFNKIQMAEQRRNTRSTAKSKVQPV 



gi | 44 06210 | gb| AAD19 901 . 1 1 

gi j 2661771 j emb | CAA74140 . 1 1 

gi [4416322 |gb|AAD20314 .l| 

gi | 126423 70 |gb|AAK0 0231 .1 |AF22 

9306aa.txt 



- ME G P I PPHS LE AE Q S VLGS I LLDS D WIDE VEGLL PS PE AF Y AEAHRK I Y 
QQ AGR VP PQA VELE QA VLGAML I E PEAI PRALE I LT- PEAFYDGRHQR I F 
LFSERI PPQS I EAE Q A VLGA VFLD PAAL V PA S E I L I - PEDFYRAAHQKI F 
TAALKVPPHS I EAE QAVLGGLMLDNNAWER VLDQ VS - DGDFYRHDHRL I F 
ND YGR I Q PQA PELE E A VLGALM I E KDAYS L VS E I LR - PES FYEHRHQL I Y 



gi | 4406 210 |gb|AAD199 01 .1 | 

gi | 2661771 j emb | CAA74140 . 1 | 

gi j 4416322 |gb|AAD20314 .l| 

gi | 12642370 |gb|AAK00231.l|AF22 

9306aa.txt 



AAMQALRS QGR P VDL VTLSEELS RRGQLEE VGGTAYLLQLSEATPTAAYA 
RAIVRLFEQNRGVDLLTVTEELRRTGELEQAGDTIYLSELTTRVASAANV 
HAMLRVADRGEPVDLVTVTAELAASEQLEE I GGVS YLSELADAVPTAANV 
RA VHKLADANQ PFD VVTLHEQLD KE GLS S Q VGGLA YLAE LAKNT PS VAN I 
AAI TDLAVNQKP VD I LT VKEQLS KRGELEE VGGPFY I TQLS S KVAS S AHI 



gi|44 06210|gb|AAD19901.l| 

gi | 2661771 j emb | CAA7414 0 . 1 1 

gi | 4416 32 2 | gb | AAD2 0314 . 1 | 

gi j 1264 2370 |gb | AAK00231 . 1 | AF22 

9306aa.txt 



EHYAR I VAE KWTLRRL I QAAGEAMRLAYEEAG - SLDE I LDTAGKKI LE VA 
EYHAR 1 1 AEKALLRRMI EVMTLLVGRAYDPAA- DAFELLDEVEAE I FRLS 
E Y Y AR I VEE KS VLRRL I RTATS I AQDG YTRE D - E I D VLLDE ADR K IME VS 
KAYAAI IRERATLRQLIS I STDI ADNAFNPQGRNAAEILDDAERQIFQIA 
EYHAR 1 1 AQKYLARELITFTSNI QSKAFDETL- DVDDLMQEAEGKLFEI S 



gi|44 06210|gb|AAD19901.l| 

gi | 2661771 | emb | CAA74140 . 1 | 

gi | 4416 32 2 |gb| AAD2 0314 .1 | 

gi j 12642370 | gb | AAK00231.1 |AF22 

93 06aa.txt 



gi|44 06210|gb|AAD19901.l| 
gij 2661771] emb |CAA7414 0.l| 
gi | 441632 2 |gb|AAD2 0314 .1 | 
gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



gi |44 06210|gb|AAD19 901 .1 1 

gi j 2661771 |emb|CAA7414 0.l| 

gi | 44163 22 |gb|AAD2 0314 .1 | 

gi |l2642370|gb|AAK00231.l|AF22 

93 06aa.txt 



LTKTDTEAR P - MREL VHETFEHI EALFQNKGE VAGVRTGFKELDQL I GTL 
D VHL R KAAR S - MNE WKETLERLE A I HGR PGG I TG VPS GFHQLDALTGGW 
QRKHS GAFKN - I KD I LVQTYDN I EMLHNRDGE I TG I PTGFTELDRMTSGF 
EAR P KTGGP VG VNE LLTMA I DR I DTLFNS DS D I TG I S TG YTDLDE KTS GL 
QRNMKKDYTQ- INPI I AE A YE Q I QKAAARTDGLS GLE S G YTKLDKMTS GW 



GPGS LNI I AAR PAMGKTAFALT I AQNAALK - - EGVGVG I YS LEMPAAQLT 
QRGDLI 1 1 AAR PSMGKTAFALS CARNAALHPHYGTGVA I FS LEMGAEQLA 
QRSDLI I VAARPSVGKTAFALNIAQNVATK- - TNENVAIFSLEMSAQQLV 
QAADLI I VAGRPSMGKTTFAMNLVENAVLR- -TDKAVLVFSLEMPGESLI 
QKSDLI 1 1 AARPAMGKTAFVLSMAKNI AVN- - FRNPVALFSLEMSNVQLV 



LRMM CS EAR I DMNR VRLGQLTDRDFS RL VD V AS RLSE AP I Y I DDT PDLTL 
QRLLTAEAR VDAQAARTGRLRDEDWRKLARAAGRLSDAP I F I DDT PS LGV 
MRMLCAEGNINAQNLRTGKLTPEDWGKLTMAMGSLSNAGI Y I DDTPSIRV 
MRMLS SLGR I DQTK VRSGQLDDDDWPRLTS A VNLLNDRKLF I DDTAGI S P 
NRLI SNVCEI PSEKI KSGQLAAYEWQQLDYKLKDLLDAPLYVDDTPSLSV 



gi | 44 06210 | gb | AAD19901 . 1 | 

gi | 2 661771 j emb| CAA74140 .1 1 

gi [4416322 | gb| AAD20314 . 1 | 

gi j 12642370 | gb | AAK00231 . 1 |AF22 

9306aa.txt 



ME VR ARARRL VS QN - Q VGL 1 1 1 D YLQLMS G PGS GKS GENRQQE I AA I SRG 
LELR AKCRRLKAEH - D I GL V I VD YLQLMQ AS HM PRN - ANRE Q E I AQ I S RS 
SD I RAKCRRLKQES - GLGM I VI DYLQL I QGSGRS KE - - NRQQEVS EISRS 
SEMRARTRRLAREHGE I AM IMVD YLQLMQ I PGS AGD - - NRTNE I SE I SRS 
FELRTKARRLVREH - GVR 1 1 1 IDYLQLMNASGMAFG- -SRQEEVSTISRS 



gi | 44 06210 |gb | AAD19901 . 1 | 
gi | 2661771 | emb | CAA74140 . 1 | 
gi j 4416322 |gb| AAD2 0314 . 1 | 

gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



gi | 44 06210 | gb | AAD1 9901 .1 | 
gi | 2661771 j emb | CAA74140.1] 
gi j 4416322 jgb | AAD20314 . 1 1 
gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



LKALARELGIPI IALSQLSRAVEARP- - -NKRPMLSDLRES 

LKALAKELNVPWALSQLSRAVETRGG--DKRPQLSDLRESGCLAGDTLI 

LKALARELE VPVI ALSQLSRS VEQRQ - - - DKRPMMSDI RES 

LKALAKEFNCPVIALSQLNRSLEQRP- - -NKRPVNSDLRES 

LKGLAKELNI PI I ALS QLNRG VESREGLEGKR PQLS DLRE S 

**.**:*: *::*****.*.:** .*** **.*** 



TLADGRR VP I RELVS QQNFS VWALNPQTYRLERAR VSRAFCTGI KPVYRL 
G 



-G- 
-G- 



gi| 44 06210 |gb|AAD19901.l| 
gi | 2661771 | emb | CAA74140 . 1 | 
gij 4416322 |gb|AAD20314 .l| 

gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



TTRLGRS IRATANHRFLTPQGWKRVDELQPGDYLALPRR I PTASTPTLTE 



gi|44 06210|gb|AAD19 901.l| 

gi | 2661771 |emb|CAA7414 0.1 | 

gij 4416322 jgb | AAD2 0314 . 1 1 

gij 12642370 | gb | AAK00231.1 |AF22 

9306aa.txt 



AELALLGHL I GDGCTLPHHV I QYTS RDADLATLVAHLATKVFGS KVTPQ I 



gi | 44 06 210 |gb| AAD19 901 . 1 1 
gi | 2661771 |emb|CAA74140 .1 | 
gi|4416322|gb|AAD20314.l| 

gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



RKELRWYQVYLRAARPLAPGKRNPISDWLRDLGIFGLRSYEKKVPALLFC 



gi |44 06210|gb|AAD19901 .1| 
gi |2 66177ljemb|CAA7414 0.l| 
gi|4416322|gb|AAD2 0314.l| 
gi|l2642370|gb|AAK00231.l|AF22 
93 06aa.txt 



QTSEAIATFLRHLWATDGCIQMRRGKKPYPAVYYATSSYQLARDVQSLjLL 



gi I 4406210 |gb| AAD19901 .1 1 
gi j 2661771 1 emb | CAA7414 0 . 1 1 
gi j 4416322 |gb|AAD20314 . 1 1 

gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



RLGINARLKTVAQGEKGRVQYHVKVSGREDLLRFVEKIGAVGARQRAALA 



gi | 44 06210 | gb | AAD199 01 . 1 | 
gi j 2661771 | emb | CAA7414 0 . 1 1 
gij 4416322 | gb | AAD2 0314 . 1 | 

gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



SVYDYLS VRTGNPNRDI I PVALWYELVREAMYQRGISHRQLHANLGMAYG 



gi| 44 06210 |gb|AAD19901 .1 | 
gi | 2 661771 |emb| CAA74140 .1 | 
gi j 4416322 |gb|AAD20314 .1 | 
gi|l2642370|gb|AAK00231.l|AF22 
9306aa.txt 



GMTL FRQNLS RARALRLAEAAAC PELRQLAQ S D VYWD P I VS I E PDGVEE V 



gi |44 06210|gb|AAD19901.1 1 

gi I 2661771 j emb | CAA74140 . 1 1 

gi [4416322 |gb|AAD20314.l| 

gi j 12642370 |gb|AAK00231 .1 |AF22 

9306aa.txt 



gi | 4406210 |gb|AAD19901.l| 

gi j 2661771 j emb | CAA74140 . 1 1 

gi j 4416322 jgb|AAD20314 .1 1 

gij 12642370 |gb|AAK00231.l|AF22 

9306aa.txt 



gi | 44 06210 |gb|AAD19 901.1 1 

gi j 2 661771 1 emb | CAA74 14 0.1 1 

gi j 4416322 |gb|AAD20314.1 1 

gi | 12642370 |gb|AAK00231 .1 | AF22 

9306aa.txt 



S I EQDADLVMF I YRDEY YNPHS EKAG 

FDLTVPGPHNFVANDI I AHNS IEQDADWLF I YRPERYGITVDENGNPTE 

s I EQDAD I VAFLYRDD YYNKDS ENKN 

AIEQDADVIMFVYRDEVYHPETEHKG 

A I EQDADMVCF I HRPE Y YK I FQDDKGNDLR 

- I AE 1 1 VGKQRNGPTGTVELQFHASHVRFND LARD 

GIAE 1 1 1 GKQRNGPTGTVRIAF I NQ YARFEN LTMY 

- I IE 1 1 1 AKQRNGP VGTVQLAF I KE YNKFVN LERR 

- VAE III GKQRNGP I GF VRLAF I GKYTRFEN LAPG 

GMAE 1 1 1 AKHRNGA VGD VLLR FKGE YTR FQN PDDDM V I PLPDAGAMLGSR 

A 

Q PE PGT PLPET PDE T I L PS GP PDE A PF 

FDEAQIPPGA 

MYNFDDDE 

MNNTGTVPPPPAEFAPQNSNPFGGENDGPLPF 



